Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

High Quality Information Querying

These are old paragraphs recovered from the project site www.hiqiq.de

Project Overview

The development of the Internet and the World Wide Web  during recent years has made it possible and useful to access many different information systems anywhere in the world to obtain desired information. Traditionally, access to digitized information is handled by local, centralized information management systems, that are under direct control of the owner, who is in most cases identical with or within the same organization as the user. This has changed - much information now obtained by a user is distributed and no longer under his control: Information systems accessed through the Internet and the WWW are locally autonomous (i) in what information they provide and (ii) how they provide it. Content, quality, access cost and many other parameters of these systems may differ greatly and cannot be influenced by the user. Finally, a user typically has several similar information sources to choose from and must decide which ones to query. We address the problem of querying multiple autonomous and heterogeneous information sources and delivering an overall ``satisfying'' answer to the user.

We use information quality criteria (see here

  • Select good sources prior to query planning
  • Speed up and enhance query planning
  • Select best plans for execution
  • Merge query results to high quality responses to the user

See here for details on these topics. 

The HiQIQ project is developed and maintained by the dbis database group of the Humboldt-University of Berlin.

This research was supported by the German Research Society (DFG), Berlin-Brandenburg Graduate School in Distributed Information Systems (GkVI) (DFG grant no. GRK~316).

top of page
 

Project Details

With the recent shift of research and businesses from centralized databases to global information systems and autonomous information sources, research has also turned from traditional query optimization to the field of query planning. Query planning is the problem of finding query execution plans across distributed, heterogeneous, and autonomous information sources. We find that the main discriminator for different query execution strategies is no longer response time but information quality (IQ). We examine the exploitation of IQ criteria to answer user queries to mediator based information systems. We discuss how IQ metadata can be used to (i) improve the quality of query results and (ii) improve the performance of query planning algorithms.

It is a well acknowledged fact that consideration of information quality is an important issue for large-scale distributed information systems.  Further, there is much research on capturing and modeling information quality. However, with one exception, there is no research that applies quality reasoning to query planning on the Web.

This section explains in more detail the architecture of our system, the tasks it performs and some evaluation (future work

  • The Architecture of HiQIQ.
  • The Tasks of HiQIQ
    • Select good sources prior to query planning
    • Speed up and enhance query planning 
    • Select best plans for execution 
    • Merge query results to high quality responses to the user

The Architecture

We assume a mediator based information system, i.e., a set of
wrapped WWW information sources, covering different aspects of a certain application domain, and a mediator with the task of responding to user queries by using the underlying sources. The sources may be overlapping both intensionally and extensionally. They are modeled as views against the global relational schema of the mediator. Each view is rated by a set of IQ criteria such as completeness, understandability, or accuracy. We aim to produce a system that efficiently finds the optimal query results, i.e., results with maximal quality regarding the IQ criteria.

The Tasks

A simple application for these research issues is a meta search
engine which uses existing search engines as its distributed information sources. Quality criteria for search engines include
completeness, update-frequency, and several others. More complex examples include stock information systems, traveling guides, or distributed molecular biology databases.
 

Source Selection

The enormous number of information sources available to users makes it necessary to query only the most appropriate sources. The information quality offered by these sources can and must be a criterion for source selection. However, information quality has many dimensions, both subjective and objective, and it is thus difficult to directly compare sources with one another or give a ranking of sources. We use several multicriteria decision-making methods to solve these problems. 
 

Query Planning

To achieve efficient planning, there is a further use for IQ
criteria. Current algorithms, for instance the bucket algorithm by Levy et al., are exponential in the size of the user query and the number of views that participate in the system. We have developed a branch & bound planning algorithm that
has two advantages: By intelligent branching the algorithm
efficiently traverses the exponential search space and by computing upper quality bounds it efficiently prunes the search
space. The upper quality bound of a subplan is the maximal IQ that any complete plan containing this subplan can reach. If the upper bound of a subplan is lower than the quality of all of the top N plans that are already complete, the algorithm can drop that subplan and thus prune the search space. Even though the worst case for this algorithm is still exponential, experiments have shown that the upper quality bounds for subplans are very tight. In fact, in typical settings the algorithm overcomes the exponential behaviour and we observe linear runtime.
 

Plan Selection

To compare plans we introduce the concept of merge functions for quality scores. Imagine two views of a plan that
are connected with a join operator. For each individual view we have quality scores in each criterion, represented as an IQ
vector. To determine the quality of the join result of the two we ``merge'' the two IQ vectors using special merge functions for each criterion. Consider the accuracy criterion which is given as a percentage and represents the probability that the data is accurate. The merge function for accuracy is the multiplication, because the probability that either source delivers inaccurate data is the product of the individual probabilities. In such a way we specify a function for each criterion and aggregate IQ scores along the plan, merging them at each join operation. The result is an IQ vector for the entire plan, which we then compare with the vectors of all other plans using one of the decision making methods mentioned above.
 

Result Merging

Once results from information sources have arrived, they can and should be ordered according to there usefulness to the user and if that is not possible according to their applicability to the user query. Again, IQ criteria are used to achieve this ordering. Users can give weightings to different quality aspects, for instance favoring young information over older information.  

top of page

People

The HiQIQ project is developed and maintained by the dbis database group of the Humboldt-University of Berlin.

A description of a sample demo is here.

Information Quality

Defining information quality (IQ) is extremely difficult due to the subjective nature of quality. Definitions such as "fitness for use" are general enough to be correct, but they are unusable when IQ is to be evaluated.

Information quality (IQ) has many facets. To capture these many dimensions, quality is often described as some set of criteria, for instance by Wang and Strong (1996). What follows is a long list of IQ criteria, as they have surfaced during research. Not all criteria must be used in all systems. Also, new criteria can easily be included to this list.

  •  Content-related Criteria 
    •  Accuracy 
    •  Completeness 
    •  Customer Support 
    •  Documentation 
    •  Interpretability 
    •  Relevancy 
    •  Value added 
  •  Technical Criteria 
    •  Availability 
    •  Latency 
    •  Price 
    •  Response time 
    •  Security 
    •  Timeliness 
  •  Intellectual Criteria 
    •  Believability 
    •  Objectivity 
    •  Reputation 
  •  Instantiation-related Criteria 
    •  Amount of data 
    •  Representational conciseness 
    •  Representational consistency 
    •  Understandability 
    •  Verifiability 

For links to other sites and projects concerning information quality, click here.


Content-related Criteria

Accuracy 

is the quotient of the number of correct values in the source and the overall number of values in the source. For our context this is the percentage of data without data errors such as non-unique keys or out of range values. 


 

Increasing accuracy is a main goal of many research efforts. Accuracy is often used synonymously with data quality, as opposed to information quality. For us, data quality or accuracy is only one aspect of the overall information quality, which includes the entire set of criteria in this list. 

 

Considering the accuracy criterion in a WWW information system setting has the same importance as for traditional databases. Accuracy is one of the main intrinsic properties of information. Incorrect information is hard to detect, useless, and in many cases even harmful. 

 

Synonyms: data quality, error rate, correctness, reliability, integrity, precision 

Completeness

is the quotient of the number of non-null values in a source and the number of non-null values in the combination of all available sources. Applied to the relational schema of Chapter the number of non-null values in the combination of all available sources corresponds to the size of the universal relation: The number of attributes multiplied with the number of tuples if all available sources were queried. The number of non-null values in a source is then the number of values a source can insert into the schema of the universal relation. 

 

We define completeness more formally in Chapter . There we analyze this criterion in great depth and apply it to several application domains. In Chapter  we perform optimization to maximize completeness. 

 

Completeness is of great importance in information systems that integrate multiple information sources. One of the main goals for integration is to increase completeness: Querying only one source typically gives only one part of the result. Querying another source will provide another, possibly overlapping part. The more sources we query, the more complete the result will be. 

 

Synonyms: coverage, scope, granularity, comprehensiveness, density, extent 

Customer support 

is the amount and usefulness of human help via email or phone. This criterion is closely related to the documentation criterion below. It is one part of an overall help system to guide users in understanding and using information. Depending on the type of support, one part of the measure could be the average waiting time for a response. Another, more difficult part to be assessed, is how useful the help is. 

 

For a discussion on the importance of this criterion, see the documentation criterion below, where the same arguments apply. 

Documentation

is the amount and usefulness of documents with meta information. For WWW information systems documentation usually is in the form of "help"-links that lead to Webpages that explain the provided data. As a simple measure we count the number of words in the documentation. Issues of usefulness and understandability are already covered by other criteria. We extend the scope of those scores to the documentation part of the source. 

 

The importance of the documentation criterion depends on the application: Often the presentation of information is self-describing and it is not necessary to measure how well a source documents its information. For instance, this is the case for search engines. On the other hand, there are domains where integration and use of the source is not possible without good documentation. Molecular biology information sources have great problems with synonyms and homonyms and other types of heterogeneity. Without a good documentation query results are very prone to misunderstanding. 

 

Synonyms: clarity of definition, traceability 

Interpretability

is the degree to which the information conforms to technical ability of the consumer. Technical abilities include languages spoken, units understood, etc. A highly interpretable source must also provide clear and simple definitions of all elements of the information. In this sense interpretability is similar to documentation and understandability. 

 

In integrated information systems, interpretability of a source is not as important as other criteria, because we assume that much of the issues are hidden by wrappers and the mediator. The wrappers of a source can already convert units to suit the user, text can be automatically translated at least to a useful extent, etc. It is then up to the wrapper to present the integrated information in an interpretable way. Concluding, an information source with a high interpretability is more easy to include in a mediated system but the criterion play a less important role once the source is successfully integrated. 

 

Synonyms: clarity of definition, simplicity 

Relevancy (or relevance)

is the degree to which the provided information satisfies the users need. Relevancy is an often used criterion in the field of information retrieval. There, a document or piece of information is considered to be relevant to the query, if the keywords of the query appear often and/or in prominent positions in the document. 


 

The importance of relevancy criterion depends on the application domain. For instance, for search engines relevancy is quite important, returned Webpage links should be as relevant as possible, even though this is difficult to achieve. For instance a query for the term "jaguar" at any WWW search engine will retrieve document links both for the animal and the automobile. If the user had the animal in mind, the links to automobile sites should have been considered as not relevant. In other application domains, relevancy is implicitly high. For instance a query for IBM stock quotes in an integrated stock information systems will only return relevant results, namely IBM stock quotes. The reason for this discrepancy is the definition of the domain: Search engines have the entire WWW as a domain and thus provide much information that is of no interest to the user. The domain of a stock information system is much more clear cut and much smaller, so a query is less likely to produce irrelevant results. 

For our purposes we reduce the relevancy criterion to a correctness criterion. If a result is correct with respect to the user query, we assume it is also relevant. If it is actually not relevant, the user query was either incorrect with respect to what the user had in mind or it was not specific enough. Relevance feedback techniques were developed by Salton and McGill to make a query more specific and increase relevancy. 

Synonyms: domain precision, minimum redundancy, applicability, helpfulness 

Value-Added

is a criterion that measures the amount of monetary benefit the use of the information provides. This criterion is typical for decision support type of information systems where a cost-benefit-calculation is undertaken. The value-added criterion must be considered when there is cost involved obtaining the information and when the nature of the information is yet unknown. 

 

Often value-added cannot be attributed to the source of the information but only to the information itself. A stock information system will provide stock quotes but cannot influence them and thus cannot increase "value-addedness"; a search engine has no influence on how useful its results are. For this reason this criterion is often not considered for WWW information systems. 


top of page

Technical Criteria

Availability

of an information source is the probability that a feasible query is correctly answered in a given time range. Availability is a technical measure concerning hardware and software of the source and the network connections between user, mediator, wrappers, and sources. Typically, availability is also time-dependent due to different usage patterns of the information source. 

 

Availability is an important criterion for WWW information sources for many reasons: Time-of day and week dependent network congestion; world-wide distribution of servers; high concurrent usage; denial-of-service attacks; planned maintenance interruptions. Query execution in integrated systems is especially vulnerable to low availability because usually all participating sources of a query execution plan must be available in order for the query to be executed correctly. In Section  we pay special attention to the availability criterion and propose an algorithm that dynamically adapts its optimization strategy in case of an unavailable source. 

 

Synonyms: accessibility, technical reliability, retrievability, performability 

Latency

is the amount of time in seconds from issuing the query until the first information reaches the user. If the result of the query is only one piece of information, e.g., one stock quote, latency equals response time (see below). 

 

Latency is an important criterion in WWW information system settings for two reasons: Information is sent over the internet using hypertext transfer protocol (http). This protocol sends data packaged in chunks of up to 64 kilobyte. If the entire response has a larger size, the first package can be displayed before further packages arrive. Additionally, many sources withhold the entire result and only return the first part. For instance,search engines typically allow return only the first 10 links. If the users desires more results, another query must be posed by following a link. The second reason is that in many application the user is actually only interested in the first part of the information or only in an overview of the result. Again, search engines are a good example. Often, the first 10 results are enough to satisfy the user, especially if the results are ranked well. For many other applications, not the actual result, but the number of results is the only interest of the user. Consider a user querying a stock information system for companies whose stock have risen more than 50% during the last year. Most often, not the actual companies but their number is of interest. 

 

Synonyms: Often response time and latency are used synonymously. 

Price

is the amount of money a user has to pay for a query as determined by the provider. Commercial data sources usually either charge on a subscription basis for their information or on a pay-per-query or pay-per-byte basis. Often there is a direct tradeoff between price and other IQ criteria. Free stock information services provide stock quotes with some delay (usually 15 minute) while subscription systems provide the quotes in realtime. Also there may be a hidden cost in retrieving information: Users spend time online paying for the internet connection and users are exposed to advertisements. 

 

Considering price is important if at least one integrated information source charges money for information. It is common opinion that the world wide Web has prospered due to its free information services. Information sources earn money by display advertisement. Experts predict a change towards high quality information sources that charge money for their services. 

 

Synonyms: query value-to-cost ratio, cost-effectivity 

Response time

measures the delay in seconds between submission of a query by the user and reception of the complete response from the information system. The score for this criterion depends on unpredictable factors such as network traffic, server workload etc. Another factor is the type and complexity of the user query. Again this cannot not be predicted, however, it can be taken into account, once the query is posed and a query execution plan is developed. Finally, the technical equipment of the information server plays a role as well. However, in WWW settings network delay usually dominates all other factors. 

 

Response time is the main criterion for traditional database optimizers. While for WWW information systems it is just one aspect among many other IQ criteria, it is still of some significance. Because of frequent time-outs and unknown availability of sources, users waiting long for a response from a WWW information source are more prone to abort the query than database users. This cancellation can be prevented by low latency, which gives users at least some results early on. Another reason for the importance of low response time is the potential competition on the Web. With many alternative sites the users will quickly switch from one source to another to find the desired information. An integrated system such as a meta search engine avoids this effect but must also consider response time when deciding which sources to use to answer a query. 

 

Synonyms: performance, turnaround time 

Security

is the degree to which information is passed privately from users to the information source and back. Security covers technical aspects such as cryptography, secure login etc., but also the possibility of anonymization and authentification of the information source by a trusted organization. Most WWW information sources publish a privacy policy to show that they are concerned with the topic. 

 

The importance of security is very application domain dependent: Users of search engine typically are not concerned about privacy-quite the contrary: The meta search engine MetaCrawler provides a utility that allow users to watch queries as they are passed to the engine. In other application domains users are very sensitive towards security: User typically prefer their stock quote lookups to be secure. Complex queries against molecular biology information systems can already spell out a valuable idea. 

 

Synonyms: privacy, access security 

Timeliness

is the average age of the information in a source. The unit of timeliness depends on the application: for some seconds are appropriate, for others days are sufficiently precise. Here, the age of data is not the time between creation of the data and now but the time between the last update or verification of the data. For instance the timeliness of search engines is their update-frequency, i.e., the frequency with which they re-index Web pages. It is not the age of the Web page itself. For stock information systems, timeliness is a measure for the delay with which stock quotes are presented. Typical free services have a 15 minute delay between the occurrence of a quote and its delivery to the user, while subscription quote services have much less or even no delay. In a fast growing area such as molecular biology it is reasonable to use the update-frequency of data source rather than the average age of the data as criterion. 

 

Timeliness is arguably one of the most important criteria for WWW information sources. The main advantage of the Internet over traditional information sources like newspapers or journals is its ability to provide new information almost instantly and world-wide. A main reason of users to turn to WWW information services is to obtain up-to-date information. For search engines, high timeliness for instance means less dead links, for stock information systems high timeliness allows quicker reactions to changes on the stock market. 

 

Synonyms: up-to-date, freshness, currentness 


top of page

Intellectual criteria

Believability

is the degree to which the information is accepted as correct by the user. In a sense, believability is the expected accuracy. Therefore it can be determined by the same unit as accuracy, but generally, the believability criterion will be influenced by many other factors so that a generic "grade" will be more appropriate. 

 

When querying autonomous information sources believability is an important criterion. Apart from simply providing information, a source must convince the user, that this information is "accepted or regarded as true, real, and credible". 

 

Synonyms: error rate, credibility, trustworthiness 

Objectivity

is the degree to which information is unbiased and impartial. The criterion score mainly depends on the affiliation of the information provider. Also, the criterion is strongly related to the verifiability criterion: The more verifiable a source is, there more objective it will be. Again, objectivity is measured by some grade as there is nor "real" unit for this criterion. 

 

Objectivity is an important criterion if users fear some malice of the information source. This fear could be approached by simply not using an information source with low objectivity or at least by verifying the information. Search engines often display biased information for two reasons: (i) Web pages indexed by the search engine add certain keywords to their page to be ranked higher for searches. A popular example is to repeat the word "sex" thousands of times on a web page. (ii) Search engines can be payed by Web site providers to purposefully rank their pages higher than others overriding the standard ranking algorithm employed by the search engine. Such bias is difficult to detect since search engines do not publish their ranking algorithms. Stock quotes on the other hand can easily be verified, so bias is not very likely and thus, objectivity is not an important criterion for that domain. 

Reputation

is the degree to which the information or its source is in high standing. For instance, the Yahoo stock quote service might have a higher reputation than that of some off shore bank; the CNN news server might have a higher reputation then that of the Phoenix Gazette. Reputation increases with a higher level of awareness among the users. Thus, older, long-established information sources will typically have a higher reputation. 

 

The reputation criterion can be important with some application. For instance, we observed that most biologists actually prefer certain sources over others because of their higher reputation. Also, people tend to trust data from their own institute more than external data, as they also tend to prefer well-known sources. This fact can be expressed with the help of reputation scores. 

 

Synonyms: credibility


top of page

Instantiation-related Criteria

Amount of data

is the size of the query result, measured in byte. Whenever appropriate, amount can also be measured as the number of result tuples. For instance, the number of links a search engine can return for a single request typically varies from 10 to 100. Note that this is independent of the actual number of hits a search engine discovers. We know of know search engine that will actually return more than 100 links, even if more were found. When querying a stock information service, for company profiles, amount is the length of the profile in byte. 

 

We argue that the larger the amount of data, the better we consider the source or response. There are methods to reduce the amount of data in a sophisticated way. For instance, techniques from the information retrieval area can be applied to find the best links from a set returned by a search engine. The more input such techniques have, the better their results will be: The probability to find a relevant link (by hand or automatically) is larger if more links are returned. 

 

The importance of the amount criterion depends on the type of query. In a query for a the stock quote of a certain company the amount of data returned is of no importance-it is simply a number. However, in a query for all information on a company including profiles, press releases etc, amount can be quite important. 

 

Synonyms: essentialness 

Representational conciseness 

is the degree to which the structure of the information matches the information itself. Search engines typically have a high conciseness-their main results are link lists which are represented as such. Molecular biology information systems on the other hand often have a low conciseness with incomprehensible data formats, many abbreviations, and unclear graphical representations. Also to most results the systems deliver a large amount of historical data which is no longer valid. 

 

In our context of a mediator-wrapper architecture representational conciseness is only of marginal importance. Wrappers extract the information from the sources and restructure them according to the global schema of the mediator. Any representational inconciseness would be caught by the wrapper and hidden to the user. Note however, representational conciseness is a measure for the complexity and stability of a wrapper. The less concise the representation is, the more difficult it is to build a wrapper around the source to the degree that parts or all of the information cannot be extracted. Low conciseness regarding previous information makes a wrapper highly unstable, i.e, the wrapper must be maintained and updated frequently. 

 

Synonyms: attribute granularity, occurrence identifiability, structural consistency, appropriateness, format precision 

Representational consistency

is the degree to which the structure of the information conforms to previously returned information. Since we review multiple sources, we extend this definition to not only compare compatibility with previous data but also with data of other sources. Thus representational consistency is also the degree to which the structure of the information conforms to that of other sources. 

 

We assume wrappers to deliver a relational export schema which is always consistent with the global schema against which we query. Representational consistency is thus a criterion to measure the work of the wrapper necessary to parse files, transform units and scales or translate identifiers into canonical object names. 

 

Synonyms: integrity, homogeneity, semantic consistency, value consistency, portability, compatibility 

Understandability

is the degree to which the information can be easily comprehended by the user. Thus, understandability measures how well a source presents its information, so that the user is able to comprehend its semantic value. Understandability is measured as a grade. The grade must be specified only once by the user and remains the same as long as the source does not undergo major changes in its appearance. The grade could possibly be determined with the help of a questionnaire containing questions on structure, language, layout etc. 

 

Understandability is only marginally important for the mediated information systems for the same reason as for representational conciseness. A wrapper extracts information from the information source and transforms it according to the relational schema of the mediator. Any good or bad understandability will be lost in this process. However, there are application domains or types of information, where the understandability score is retained. For instance, the understandability of a new article remains the same, independent of any representational changes. Also, graphics typically are not changed by the wrapper or mediator, so the understandability remains unchanged as well. 

 

Synonyms: ease of understanding 

Verifiability

is the degree and ease with which the information can be checked for correctness. When information is mistrusted, it should be verified with the help of a, if possible unbiased, third party. Verifiability is high if either the information source names the actual source of the information or if it points to a trusted third party source where the information can be checked for correctness. Note, that verifiability differs from believability in that verification can find an information correct incorrect, while belief trusts the information without checking. 

 

Verifiability is an important factor if the mediated system includes sources with a low believability or reputation. Especially WWW information sources can suffer a low scores in these criteria because they have not had the time to establish a good reputation. 

 

Synonyms: naturalness, traceability, provability

 

Related sites

Other IQ Projects

  • TDQM Total Data Quality Management
  • DWQ Welcome page
  • Qualität von Informationsdiensten und elektronischer Märkten
  • DaQuinCIS: Data Quality in Cooperative Information Systems

IQ resources

  • Quality Definition Frame Definition Page
  • Information Quality WWW VL
  • Information Quality Resources 

Query Planning in Mediated Information systems(link lists)

  •  Mediation by Gio Wiederhold
  •  Related projects to database integration by Ulf Leser

Miscellaneous

  • The Metaphysics of Quality -
  • The Philosophy of Robert M Pirsig
  • The Quality Information Checklist

top of page