With the recent shift of research and businesses from centralized databases to global information systems and autonomous information sources, research has also turned from traditional query optimization to the field of query planning. Query planning is the problem of finding query execution plans across distributed, heterogeneous, and autonomous information sources. We find that the main discriminator for different query execution strategies is no longer response time but information quality (IQ). We examine the exploitation of IQ criteria to answer user queries to mediator based information systems. We discuss how IQ metadata can be used to (i) improve the quality of query results and (ii) improve the performance of query planning algorithms.
It is a well acknowledged fact that consideration of information quality is an important issue for large-scale distributed information systems. Further, there is much research on capturing and modeling information quality. However, with one exception, there is no research that applies quality reasoning to query planning on the Web.
This section explains in more detail the architecture of our system, the tasks it performs and some evaluation (future work
- The Architecture of HiQIQ.
- The Tasks of HiQIQ
- Select good sources prior to query planning
- Speed up and enhance query planning
- Select best plans for execution
- Merge query results to high quality responses to the user
We assume a mediator based information system, i.e., a set of
wrapped WWW information sources, covering different aspects of a certain application domain, and a mediator with the task of responding to user queries by using the underlying sources. The sources may be overlapping both intensionally and extensionally. They are modeled as views against the global relational schema of the mediator. Each view is rated by a set of IQ criteria such as completeness, understandability, or accuracy. We aim to produce a system that efficiently finds the optimal query results, i.e., results with maximal quality regarding the IQ criteria.
A simple application for these research issues is a meta search
engine which uses existing search engines as its distributed information sources. Quality criteria for search engines include
completeness, update-frequency, and several others. More complex examples include stock information systems, traveling guides, or distributed molecular biology databases.
The enormous number of information sources available to users makes it necessary to query only the most appropriate sources. The information quality offered by these sources can and must be a criterion for source selection. However, information quality has many dimensions, both subjective and objective, and it is thus difficult to directly compare sources with one another or give a ranking of sources. We use several multicriteria decision-making methods to solve these problems.
To achieve efficient planning, there is a further use for IQ
criteria. Current algorithms, for instance the bucket algorithm by Levy et al., are exponential in the size of the user query and the number of views that participate in the system. We have developed a branch & bound planning algorithm that
has two advantages: By intelligent branching the algorithm
efficiently traverses the exponential search space and by computing upper quality bounds it efficiently prunes the search
space. The upper quality bound of a subplan is the maximal IQ that any complete plan containing this subplan can reach. If the upper bound of a subplan is lower than the quality of all of the top N plans that are already complete, the algorithm can drop that subplan and thus prune the search space. Even though the worst case for this algorithm is still exponential, experiments have shown that the upper quality bounds for subplans are very tight. In fact, in typical settings the algorithm overcomes the exponential behaviour and we observe linear runtime.
To compare plans we introduce the concept of merge functions for quality scores. Imagine two views of a plan that
are connected with a join operator. For each individual view we have quality scores in each criterion, represented as an IQ
vector. To determine the quality of the join result of the two we ``merge'' the two IQ vectors using special merge functions for each criterion. Consider the accuracy criterion which is given as a percentage and represents the probability that the data is accurate. The merge function for accuracy is the multiplication, because the probability that either source delivers inaccurate data is the product of the individual probabilities. In such a way we specify a function for each criterion and aggregate IQ scores along the plan, merging them at each join operation. The result is an IQ vector for the entire plan, which we then compare with the vectors of all other plans using one of the decision making methods mentioned above.
Once results from information sources have arrived, they can and should be ordered according to there usefulness to the user and if that is not possible according to their applicability to the user query. Again, IQ criteria are used to achieve this ordering. Users can give weightings to different quality aspects, for instance favoring young information over older information.
top of page