Cathleen Ramson : Estimating Metadata of Query Results using Histograms
Answering questions on a given database with SQL-queries can take hours. Firstly, the user has to wait for the query result. Depending on the size of the database, even this step can take multiple hours. Secondly, after the query result is calculated, the user still needs to interpret it. For large answer sets this can also be very time consuming. Therefore, I present Histimate, which is a technique for approximating metadata of query results. Histimate calculates metadata for a query, such as the number of distinct or null values of an attribute, much faster than the database could answer the complete query. Moreover, the metadata summarises the query result and therefore, the user can get a rough idea of the complete answer or sometimes even answer the complete question based on the metadata.
For this estimation, the runtime of Histimate does not increase with an increasing number of tuples in the database. This is achieved by approximating the actual attributes with histograms. Therefore, the runtime as well as the accuracy of the estimation depends on the level of detail of the histograms.
The main idea of Histimate is to use histograms that describe the attributes from the database and simulate complete SQL-queries on these histograms. In order to approximate the result for a given query, each SQL-operator of the query is applied separately on the histograms. As SQL-operators can remove existing tuples from the result as well as adding new ones, Histimate adds or removes values from the histograms depending on the type of the operator. After applying all operators to the histograms, Histimate extracts the metadata from the histograms.