Measures of Query cost
- To choose a strategy based on reliable information, the database system may store statistics for each relation r:
- – the number of tuples in r.
- – the size in bytes of a tuple of r (for fixed-length records).
- V(A, r) – the number of distinct values that appear in relation r for attribute A.
- The first two quantities allow us to estimate accurately the size of a Cartesian product.
- The Cartesian product contains tuples.
- Each tuple of occupies bytes.
- The third statistic is used to estimate how many tuples satisfy a selection predicate of the form
<attribute-name> = <value>
- We need to know how often each value appears in a column.
- If we assume each value appears with equal probability, then
is estimated to have
- This may not be the case, but it is a good approximation of reality in many relations.
- We assume such a uniform distribution for the rest of this chapter.
- Estimation of the size of a natural join is more difficult.
- Let and be relations on schemes and .
- If (no common attributes), then is the same as and we can estimate the size of this accurately.
- If is a key for , then we know that a tuple of will join with exactly one tuple of .
- Thus the number of tuples in will be no greater than .
- If is not a key for or , things are more difficult.
- We use the third statistic and the assumption of uniform distribution.
- Assume .
- We assume there are
tuples in with an A value of t[A] for tuple t in .
- So tuple t of produces
- Considering all the tuples in , we estimate that there are
tuples in total in
- If we reverse the roles of and in this equation, we get a different estimate
- If this occurs, there are likely to be some dangling tuples that do not participate in the join.
- Thus the lower estimate is probably the better one.
- This estimate may still be high if the values in have few values in common with the values in .
- However, it is unlikely that the estimate is far off, as dangling tuples are likely to be a small fraction of the tuples in a real world relation.
- To maintain accurate statistics, it is necessary to update the statistics whenever a relation is modified.
This can be substantial, so most systems do this updating during periods of light load on the system.