# Measures of Query cost

# Estimation of Query-Processing Cost

- To choose a strategy based on reliable information, the database system may store statistics for each relation
*r*:- – the number of tuples in
*r*. - – the size in bytes of a tuple of
*r*(for fixed-length records). *V*(*A*,*r*) – the number of distinct values that appear in relation*r*for attribute*A*.

- – the number of tuples in
- The first two quantities allow us to estimate accurately the size of a Cartesian product.
- The Cartesian product contains tuples.
- Each tuple of occupies bytes.
- The third statistic is used to estimate how many tuples satisfy a selection predicate of the form
`<`*attribute-name*> = <*value*> - We need to know how often each value appears in a column.
- If we assume each value appears with equal probability, then
is estimated to have

tuples.

- This may not be the case, but it is a good approximation of reality in many relations.
- We assume such a uniform distribution for the rest of this chapter.
- Estimation of the size of a natural join is more difficult.
- Let and be relations on schemes and .
- If (no common attributes), then is the same as and we can estimate the size of this accurately.
- If is a key for , then we know that a tuple of will join with exactly one tuple of .
- Thus the number of tuples in will be no greater than .
- If is
**not**a key for or , things are more difficult. - We use the third statistic and the assumption of uniform distribution.
- Assume .
- We assume there are
tuples in with an

*A*value of*t*[*A*] for tuple*t*in . - So tuple t of produces
tuples in

- Considering
**all**the tuples in , we estimate that there aretuples in total in

- If we reverse the roles of and in this equation, we get a different estimate
if .

- If this occurs, there are likely to be some dangling tuples that do not participate in the join.
- Thus the lower estimate is probably the better one.
- This estimate may still be high if the values in have few values in common with the values in .
- However, it is unlikely that the estimate is far off, as dangling tuples are likely to be a small fraction of the tuples in a real world relation.

- To maintain accurate statistics, it is necessary to update the statistics whenever a relation is modified.
This can be substantial, so most systems do this updating during periods of light load on the system.