Datum der Verteidigung: 09.03.2011
The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most stops short of the integration of the actual data.
Data fusion is the process of combining multiple records representing the same real-world object into a single, consistent, and clean representation. The problem that is considered when fusing data is the problem of handling data conflicts (uncertainties and contradictions) that may exist among the different representations. In preliminary steps, duplicate detection techniques help to identify different representations of same real-world objects and schema mappings are used to find and represent equivalent descriptions if data originates in different sources.
This thesis first formalizes data fusion as part of a data integration process and describes and categorizes different conflict handling strategies when fusing data. Based on this categorization, three different database operators are presented in more detail: First, the minimum union operator and the complement union operator, which both are able to handle different special cases of uncertainties and implement one specific strategy. They condense the set of multiple representations by removing unnecessary ‘null’ values. In case of minimum union, subsumed tuples are removed as a subsumed tuple does not contain any new information. In case of complement union, two tuples are combined into one, if for each attribute the values coincide or there is a value in only one of the tuples. The result of both operators contains fewer ‘null’ values. We show how the operators can be moved around in query plans. Second, we present the data fusion operator, which is able to implement many different conflict handling strategies and can be used to resolve conflicts, resulting in one single representation for each real-world object. We also show how the operator can be moved around in query plans. For all three operators we give algorithms and show how to implement them as primitives in a database management system.
The feasibility and usefulness of the three operators is shown by including and evaluating them in two different integrated information systems developed within our research group: the HumMer system is an integrated information system that allows the user to integrate and fuse data from different data sources using a wizard-like interface. The FuSem system allows the user to apply different fusion semantics and their corresponding operators to its data, visualize the differences of the fused results and thus guides the user to the most satisfying result for the task at hand.
The thesis covers the process of data fusion in its entirety: the user chooses one or more strategies to fuse data and expresses the strategies using a relational integrated information system. Then, the system executes the operation in an efficient way and finally lets the user look at the result and explore differences between different fusion techniques/strategies.
Dissertation zum Download (PDF)