Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Arvid Heise - Data Cleansing and Integration Operators for a Parallel Data Analytics Platform (2015)

Link to thesis

The data quality of real-world datasets need to be constantly monitored and maintained to allow organizations and individuals to reliably use their data. Especially, data integration projects suffer from poor initial data quality and as a consequence consume more effort and money. Commercial products and research prototypes for data cleansing and integration help users to improve the quality of individual and combined datasets. They can be divided into either standalone systems or database management system (DBMS) extensions. On the one hand, standalone systems do not interact well with DBMS and require time-consuming data imports and exports. On the other hand, DBMS extensions are often limited by the underlying system and do not cover the full set of data cleansing and integration tasks.

We overcome both limitations by implementing a concise set of five data cleansing and integration operators on the parallel data analytics platform Stratosphere. We define the semantics of the operators, present their parallel implementation, and devise optimization techniques for individual operators and combinations thereof. Users specify declarative queries in our query language METEOR with our new operators to improve the data quality of individual datasets or integrate them to larger datasets. By integrating the data cleansing operators into the higher level language layer of Stratosphere, users can easily combine cleansing operators with operators from other domains, such as information extraction, to complex data flows. Through a generic description of the operators, the Stratosphere optimizer reorders operators even from different domains to find better query plans.

As a case study, we reimplemented a part of the large Open Government Data integration project GovWILD with our new operators and show that our queries run significantly faster than the original GovWILD queries, which rely on relational operators. Evaluation reveals that our operators exhibit good scalability on up to 100 cores, so that even larger inputs can be efficiently processed by scaling out to more machines. Finally, our scripts are considerably shorter than the original GovWILD scripts, which results in better maintainability of the scripts.

Alexander Albrecht - Understanding and Managing Extract-Transform-Load Systems (2014)

Extract-Transform-Load (ETL) systems are visual programming tools that allow the definition of complex workflows to extract, transform, and load heterogeneous data from one or more sources into target databases by placing, specifying, and connecting sources, transformations, and targets. ETL workflows are stored in repositories to be executed periodically, e.g., every hour, daily, or once a week. In the course of a complex data warehousing project up to several hundred ETL workflows are created by different individuals and stored in such repositories. Moreover, over time the created ETL workflows often become larger, more complex and thus difficult to comprehend for developers.

In this thesis we address the problem of ETL workflow comprehension in the context of managing large repositories of ETL workflows. The main contributions of this thesis include the implementation of a generic ETL management framework and the development of novel techniques for ETL workflow comprehension, such as schema decryption and ETL workflow abstraction. Our schema decryption method supports ETL developers in understanding cryptic schema labels of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. Our ETL workflow abstraction approach provides a simplified, easier to read ETL workflow by partitioning all transformations in an ETL workflow into a few informative sub-workflows. Furthermore, our approach supports an interactive zoom-in/out capability allowing an ETL developer to identify and focus only on relevant parts of a given ETL workflow.

Johannes Lorey - What’s in a Query: Analyzing, Predicting, and Managing Linked Data Access (2014)

Link to thesis

The term Linked Data refers to connected information sources comprising structured data about a wide range of topics and for a multitude of applications. In recent years, the conceptional and technical foundations of Linked Data have been formalized and refined. To this end, well-known technologies have been established, such as the Resource Description Framework (RDF) as a Linked Data model or the SPARQL Protocol and RDF Query Language (Sparql) for retrieving this information.

Whereas most research has been conducted in the area of generating and publishing Linked Data, this thesis presents novel approaches for improved management. In particular, we illustrate new methods for analyzing and processing Sparql queries. Here, we present two algorithms suitable for identifying structural relationships between these queries. Both algorithms are applied to a large number of real-world requests to evaluate the performance of the approaches and the quality of their results. Based on this, we introduce different strategies enabling optimized access of Linked Data sources. We demonstrate how the presented approach facilitates effective utilization of Sparql endpoints by prefetching results relevant for multiple subsequent requests.

Furthermore, we contribute a set of metrics for determining technical characteristics of such knowledge bases. To this end, we devise practical heuristics and validate them through thorough analysis of real-world data sources. We discuss the findings and evaluate their impact on utilizing the endpoints. Moreover, we detail the adoption of a scalable infrastructure for improving Linked Data discovery and consumption. As we outline in an exemplary use case, this platform is eligible both for processing and provisioning the corresponding information.

Ziawasch Abedjan - Improving RDF Data with Data Mining (2014)

Link to thesis

Linked Open Data (LOD) comprises very many and often large public data sets and knowledge bases. Those datasets are mostly presented in the RDF triple structure of subject, predicate, and object, where each triple represents a statement or fact. Unfortunately, the heterogeneity of available open data requires significant integration steps before it can be used in applications. Meta information, such as ontological definitions and exact range definitions of predicates, are desirable and ideally provided by an ontology. However in the context of LOD, ontologies are often incomplete or simply not available. Thus, it is useful to automatically generate meta information, such as ontological dependencies, range definitions, and topical classifications.

Association rule mining, which was originally applied for sales analysis on transactional databases, is a promising and novel technique to explore such data. We designed an adaptation of this technique for mining RDF data and introduce the concept of “mining configurations”, which allows us to mine Rdf data sets in various ways. Different configurations enable us to identify schema and value dependencies that in combination result in interesting use cases. To this end, we present rule-based approaches for auto-completion, data enrichment, ontology improvement, and query relaxation. Auto-completion remedies the problem of inconsistent ontology usage, providing an editing user with a sorted list of commonly used predicates. A combination of different configurations step extends this approach to create completely new facts for a knowledge base. We present two approaches for fact generation, a user-based approach where a user selects the entity to be amended with new facts and a data-driven approach where an algorithm discovers entities that have to be amended with missing facts.

As knowledge bases constantly grow and evolve, another approach to improve the usage of RDF data is to improve existing ontologies. Here, we present an association rule based approach to reconcile ontology and data. Interlacing different mining configurations, we infer an algorithm to discover synonymously used predicates. Those predicates can be used to expand query results and to support users during query formulation.

We provide a wide range of experiments on real world datasets for each use case. The experiments and evaluations show the added value of association rule mining for the integration and usability of Rdf data and confirm the appropriateness of our mining configuration methodology.

Christoph Böhm - Enriching the Web of Data with Topics and Links (2013)

Link to thesis

This thesis presents novel ideas and research findings for the Web of Data – a global data space spanning many so-called Linked Open Data sources. Linked Open Data adheres to a set of simple principles to allow easy access and reuse for data published on the Web. Linked Open Data is by now an established concept and many (mostly academic) publishers adopted the principles building a powerful web of structured knowledge available to everybody. However, so far, Linked Open Data does not yet play a significant role among common Web technologies that currently facilitate a high-standard Web experience.

In this work, we thoroughly discuss the state-of-the-art for Linked Open Data and highlight several shortcomings – some of them we tackle in the main part of this work. First, we propose a novel type of data source meta-information, namely the topics of a dataset. This information could be published with dataset descriptions and support a variety of use cases, such as data source exploration and selection. For the topic retrieval, we present an approach coined Annotated Pattern Percolation (APP), which we evaluate with respect to topics extracted from Wikipedia portals.

Second, we contribute to entity linking research by presenting an optimization model for joint entity linking, showing its hardness, and proposing three heuristics implemented in the LINked Data Alignment (LINDA) system. Our first solution can exploit multi-core machines, whereas the second and third approach are designed to run in a distributed shared-nothing environment. We discuss and evaluate the properties of our approaches leading to recommendations which algorithm to use in a specific scenario. The distributed algorithms are among the first of their kind, i.e., approaches for joint entity linking in a distributed fashion. Also, we illustrate that we can tackle the entity linking problem on the very large scale with data comprising more than 100 millions of entity representations from very many sources.

Finally, we approach a sub-problem of entity linking, namely the alignment of concepts. We again target a method that looks at the data in its entirety and does not neglect existing relations. Also, this concept alignment method shall execute very fast to serve as a preprocessing for further computations. Our approach, called Holistic Concept Matching (HCM), achieves the required speed through grouping the input by comparing so-called knowledge representations. Within the groups, we perform complex similarity computations, relation conclusions, and detect semantic contradictions. The quality of our result is again evaluated on a large and heterogeneous dataset from the real Web. In summary, this work contributes a set of techniques for enhancing the current state of the Web of Data. All approaches have been tested on large and heterogeneous real-world input.

Jana Bauckmann - Dependency discovery for data integration (2013)

Link to thesis

Data integration aims to combine data of different sources and to provide users with a unified view on these data. This task is as challenging as valuable. In this thesis we propose algorithms for dependency discovery to provide necessary information for data integration. We focus on inclusion dependencies (INDs) in general and a special form named conditional inclusion dependencies (CINDs): (i) INDs enable the discovery of structure in a given schema. (ii) INDs and CINDs support the discovery of cross-references or links between schemas.

An IND “A in B” simply states that all values of attribute A are included in the set of values of attribute B. We propose an algorithm that discovers all inclusion dependencies in a relational data source. The challenge of this task is the complexity of testing all attribute pairs and further of comparing all of each attribute pair's values.

The complexity of existing approaches depends on the number of attribute pairs, while ours depends only on the number of attributes. Thus, our algorithm enables to profile entirely unknown data sources with large schemas by discovering all INDs. Further, we provide an approach to extract foreign keys from the identified INDs.

We extend our IND discovery algorithm to also find three special types of INDs: (i) Composite INDs, such as “AB in CD”, (ii) approximate INDs that allow a certain amount of values of A to be not included in B, and (iii) prefix and suffix INDs that represent special cross-references between schemas.

Conditional inclusion dependencies are inclusion dependencies with a limited scope defined by conditions over several attributes. Only the matching part of the instance must adhere the dependency. We generalize the definition of CINDs distinguishing covering and completeness conditions and define quality measures for conditions. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. The challenge for this task is twofold: (i) Which (and how many) attributes should be used for the conditions? (ii) Which attribute values should be chosen for the conditions? Previous approaches rely on pre-selected condition attributes or can only discover conditions applying to quality thresholds of 100%.

Our approaches were motivated by two application domains: data integration in the life sciences and link discovery for linked open data. We show the efficiency and the benefits of our approaches for use cases in these domains.

Dustin Lange - Effective and efficient similarity search in databases (2013)

Given a large set of records in a database and a query record, similarity search aims to find all records sufficiently similar to the query record. To solve this problem, two main aspects need to be considered: First, to perform effective search, the set of relevant records is defined using a similarity measure. Second, an efficient access method is to be found that performs only few database accesses and comparisons using the similarity measure. This thesis solves both aspects with an emphasis on the latter.

In the first part of this thesis, a frequency-aware similarity measure is introduced. Compared record pairs are partitioned according to frequencies of attribute values. For each partition, a different similarity measure is created: machine learning techniques combine a set of base similarity measures into an overall similarity measure. After that, a similarity index for string attributes is proposed, the State Set Index (SSI), which is based on a trie (prefix tree) that is interpreted as a nondeterministic finite automaton. For processing range queries, the notion of query plans is introduced in this thesis to describe which similarity indexes to access and which thresholds to apply. The query result should be as complete as possible under some cost threshold. Two query planning variants are introduced: (1) Static planning selects a plan at compile time that is used for all queries. (2) Query-specific planning selects a different plan for each query. For answering top-k queries, the Bulk Sorted Access Algorithm (BSA) is introduced, which retrieves large chunks of records from the similarity indexes using fixed thresholds, and which focuses its efforts on records that are ranked high in more than one attribute and thus promising candidates.

The described components form a complete similarity search system. Based on prototypical implementations, this thesis shows comparative evaluation results for all proposed approaches on different real-world data sets, one of which is a large person data set from a German credit rating agency.

Mohammed AbuJarour - Enriched Service Descriptions: Sources, Approaches, and Usages (2011)

Service descriptions play a crucial role in several tasks in Service-oriented Architecture (SOA), e.g., for service discovery, service selection, service composition, etc. However, it has been observed that service providers – the main source of information about web services – typically release poor service descriptions, i.e., mostly technical-oriented descriptions. To tackle this lack of rich service descriptions, several approaches have been proposed to enrich poor service descriptions with additional information from other sources, e.g., community annotations, domain experts, etc. In our research, we introduce a novel approach to gather and generate additional information about web services used in collaborative application domains from multiple sources and integrate them in unified, rich service descriptions. These sources are: Websites of service providers, invocation analysis, and business processes of service consumers.

We have developed a focused crawler that automatically collects public web services from the websites of their providers and registers them in our public service registry, Depot. In addition to web services, textual annotations are extracted from the same crawled web pages and attached to their corresponding web services. This information represents the technical perspective of web services.

With the increasing number and complexity of web services, the importance of instance-level information about web services has increased. Such instance-level information, e.g., examples of input values, formats, etc., are usually not incorporated in service descriptions. We have introduced a set of methods to analyze invocations of data web services and extract useful instance-level information about them based on their invocations, e.g., tags, valid values for inputs, etc.

Traditionally, service consumers might provide ratings and feedback about web services they use in their business processes. Although this information is useful, additional valuable information can be generated from business processes, e.g., context and consuming tasks. To extract this knowledge about web services from their consuming business processes, we integrate Depot with the open-source online modeling framework Oryx. The information extracted from business processes of service consumers represents the business perspective of the used web services.

The information we gather and generate from the aforementioned sources is used to enrich poor service descriptions. With these enriched service descriptions, several tasks can be enhanced, e.g., service discovery, selection, recommendation, etc. For instance, service discovery can be enhanced by enabling service exploration through automatic categorization and tagging of web services. Moreover, we are able to provide multi-dimensional service exploration by combining the category, provider, and tags of web services and enable additional refinement through keywords for full-text search.

Armin Roth - Efficient Query Answering in Peer Data Management Systems (2010)

Peer data management systems (PDMS) consist of a highly dynamic set of autonomous, heterogeneous peers connected with schema mappings. Queries submitted at a peer are answered with data residing at that peer and by passing the queries to neighboring peers. Pdms are the most general architecture for distributed integrated information systems. With no need for central coordination, PDMS are highly flexible. However, due to the typical massive redundancy in mapping paths, PDMS tend to be very inefficient in computing the complete query result as the number of peers increases. Additionally, information loss is cumulated along mapping paths due to selections and projections in the mappings.

Users usually accept concessions on the completeness of query answers in large-scale data sharing settings. Our approach turns completeness into an optimization goal and thus trades off benefit and cost of query answering. To this end, we propose several strategies that guide peers in their decision to which neighbors rewritten queries should be sent. In effect, the peers prune mappings that are expected to contribute few data. We propose a query optimization strategy that limits resource consumption and show that it can drastically increase efficiency while still yielding satisfying completeness of the query result.

To estimate the potential data contribution of mappings, we adopted self-tuning histograms for cardinality estimation. We developed techniques that ensure sufficient query feedback to adapt these statistics to massive changes in a PDMS. Additionally, histograms can serve to maintain statistics on data overlap between alternative mapping paths. Building on them, redundant query processing is reduced by avoiding overlapping areas of the multi-dimensional data space

Falk Brauer - Adaptive Extraktion und Identifikation von Entitäten in Textdaten im Umfeld der Enterprise Search (2010)

Die automatische Informationsextraktion (IE) aus unstrukturierten Texten ermöglicht völlig neue Wege, auf relevante Informationen zuzugreifen und deren Inhalte zu analysieren, die weit über bisherige Verfahren zur Stichwort-basierten Dokumentsuche hinausgehen. Die Entwicklung von Programmen zur Extraktion von maschinenlesbaren Daten aus Texten erfordert jedoch nach wie vor die Entwicklung von domänenspezifischen Extraktionsprogrammen.

Insbesondere im Bereich der Enterprise Search (der Informationssuche im Unternehmensumfeld), in dem eine große Menge von heterogenen Dokumenttypen existiert, ist es oft notwendig, ad-hoc Programmmodule zur Extraktion von geschäftsrelevanten Entitäten zu entwickeln, die mit generischen Modulen in monolithischen IE-Systemen kombiniert werden. Dieser Umstand ist insbesondere kritisch, da potenziell für jeden einzelnen Anwendungsfall ein von Grund auf neues IE-System entwickelt werden muss.

Die vorliegende Dissertation untersucht die effiziente Entwicklung und Ausführung von IE-Systemen im Kontext der Enterprise Search und effektive Methoden zur Ausnutzung bekannter strukturierter Daten im Unternehmenskontext für die Extraktion und Identifikation von geschäftsrelevanten Entitäten in Dokumenten. Grundlage der Arbeit ist eine neuartige Plattform zur Komposition von IE-Systemen auf Basis der Beschreibung des Datenflusses zwischen generischen und anwendungsspezifischen IE-Modulen. Die Plattform unterstützt insbesondere die Entwicklung und Wiederverwendung von generischen IE-Modulen und zeichnet sich durch eine höhere Flexibilität und Ausdrucksmächtigkeit im Vergleich zu vorherigen Methoden aus.

Ein in der Dissertation entwickeltes Verfahren zur Dokumentverarbeitung interpretiert den Datenaustausch zwischen IE-Modulen als Datenströme und ermöglicht damit die weitgehende Parallelisierung von einzelnen Modulen. Die autonome Ausführung der Module führt zu einer wesentlichen Beschleunigung der Verarbeitung von Einzeldokumenten und verbesserten Antwortzeiten, z.B. für Extraktionsdienste. Bisherige Ansätze untersuchen lediglich die Steigerung des durchschnittlichen Dokumentendurchsatzes durch verteilte Ausführung von Instanzen eines IE-Systems.

Die Informationsextraktion im Kontext der Enterprise Search unterscheidet sich z.B. von der Extraktion aus dem World Wide Web (WWW) dadurch, dass in der Regel strukturierte Referenzdaten, z.B. in Form von Unternehmensdatenbanken oder Terminologien zur Verfügung stehen, die oft auch die Beziehungen von Entitäten beschreiben. Entitäten im Unternehmensumfeld haben weiterhin bestimmte Charakteristiken: Eine Klasse von relevanten Entitäten folgt bestimmten Bildungsvorschriften, die nicht immer bekannt sind, auf die aber mit Hilfe von bekannten Beispielentitäten geschlossen werden kann, so dass unbekannte Entitäten desselben Typs extrahiert werden können. Die Bezeichner der anderen Klasse von Entitäten haben eher umschreibenden Charakter. Die korrespondierenden Umschreibungen in Texten können variieren, wodurch eine Identifikation derartiger Entitäten oft erschwert wird.

Zur effizienteren Entwicklung von IE-Systemen wird in der Dissertation ein Verfahren untersucht, das alleine anhand von Beispielentitäten effektive Reguläre Ausdrücke zur Extraktion von unbekannten Entitäten erlernt und damit den manuellen Aufwand in derartigen Anwendungsfällen minimiert. Verschiedene Generalisierungs- und Spezialisierungsheuristiken erkennen Muster auf verschiedenen Abstraktionsebenen und schaffen dadurch einen Ausgleich zwischen Genauigkeit und Vollständigkeit bei der Extraktion. Bekannte Regellernverfahren im Bereich der Informationsextraktion unterstützen die beschriebenen Problemstellungen nicht, sondern benötigen einen (annotierten) Dokumentenkorpus.

Eine Methode zur Identifikation von Entitäten, die durch Graph-strukturierte Referenzdaten vordefiniert sind, wird als dritter Schwerpunkt untersucht. Es werden Verfahren konzipiert, welche über einen exakten Zeichenkettenvergleich zwischen Text und Referenzdatensatz hinausgehen und Teilübereinstimmungen und Beziehungen zwischen Entitäten zur Identifikation und Disambiguierung heranziehen. Das in der Arbeit vorgestellte Verfahren ist bisherigen Ansätzen hinsichtlich der Genauigkeit und Vollständigkeit bei der Identifikation überlegen.

Jens Bleiholder - Data Fusion and Conflict Resolution in Integrated Information Systems (2010)

The development of the Internet in recent years has made it possible and useful to access many diferent information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most stops short of the actual integration of available data.

Data fusion is the process of combining multiple records representing the same real-world object into a single, consistent, and clean representation. The problem that is considered when fusing data is the problem of handling data conficts (uncertainties and contradictions) that may exist among the different representations. In preliminary steps, duplicate detection techniques help in identifying different representations of same real-world objects and schema mappings are used to identify common representations if data originates in diferent sources.

This thesis first formalizes data fusion as part of a data integration process and describes and categorizes different conflict handling strategies when fusing data. Based on this categorization, three different database operators are presented in more detail: First, the minimum union operator and the complement union operator, which both are able to handle different special cases of uncertainties and implement one specific strategy. They condense the set of multiple representations by removing unnecessary null values. In case of minimum union, subsumed tuples are removed as a subsumed tuple does not contain any new information. In case of complement union, two tuples are combined into one, if the attribute values complement each other, i.e., if for each attribute the values coincide or there is a value in only one of the tuples. The result of both operators contains fewer null values. We show how the operators can be moved around in query plans. Second, we present the data fusion operator, which is able to implement many different conflict handling strategies and can be used to really resolve conflicts, resulting in one single representation for each real-world object. We also show how the operator can be moved around in query plans. For all three operators we give algorithms and show how to implement them as primitives in a database management system.

The feasibility and usefulness of the operators is shown by including and evaluating them in two different integrated information systems developed within our group: the HumMer system is an integrated information system that allows the user to integrate and fuse data from diferent data sources using a wizard-like interface. The FuSem system allows the user to apply different fusion semantics and their corresponding operators to its data, visualize the differences of the fused results and finally come up with the most satisfying result for the task at hand.

The thesis covers the process of data fusion in its entirety: the user chooses one or more strategies to fuse data and expresses the strategies using a relational integrated information system. Then, the system executes the operation in an efficient way and finally lets the user look at the result and explore differences between different fusion techniques/strategies.

Melanie Herschel (b. Weis) - Duplicate Detection in XML Data (2007)

Duplicate detection consists in detecting multiple representations of a same real-world object, and that for every object represented in a data source. Duplicate detection is relevant in data cleaning and data integration applications and has been studied extensively for relational data describing a single type of object in a single table. Our research focuses on iterative duplicate detection in XML data. We consider detecting duplicates in multiple types of objects related to each other and devise methods adapted to semi-structured XML data. Relationships between different types of objects either form a hierarchical structure (given by the hierarchical XML structure) or a graph structure (e.g.,given by referential constraints).

Iterative duplicate detection require a similarity measure to compare pairs of object representations, called candidates, based on descriptive information of a candidate. The distinction between candidates and their description is not straight-forward in XML, but we show that we can semi-automatically determine these descriptions using heuristics and conditions.

Second, we define a similarity measure that is suited for XML data and that considers relationships between candidates. It considers data comparability, data relevance, data similarity, and distinguishes between missing and contradictory data. Experimental evaluation shows that our similarity measure outperforms existing similarity measures in terms of effectiveness.

To avoid pairwise comparisons and thereby improve efficiency, but without compromising effectiveness, we propose three comparison strategies: The top-down algorithm is suited for hierarchically related candidates where nesting represents 1:N relationships, whereas the bottom-up algorithm is suited when nested candidates actually exist in an M:N relationships in the real world. When candidate relationships form a graph, the Reconsidering Algorithm re-compares candidate pairs to improve effectiveness. Using a comparison order that reduces the number of re-comparisons nevertheless allows efficient and effective duplicate detection.

To scale to large amounts of data, we propose methods that interact with a database to handle retrieval, classification, and update of iteratively compared candidate pairs. Empirical evaluation shows that these methods scale linearly in time with the number of candidates, the connectivity of the graph, and the fraction of duplicates among all candidates. We are the first to obtain linear behavior for all these parameters, so, in summary, this thesis presents novel methods for effective, efficient, and scalable duplicate detection in graph, and in particular in XML data.

Finally, we present XClean, the first system for declarative XML data cleaning, for which we defined operators and a specification language that is compiled to XQuery.

Alexander Bilke - Duplicate-based Matching of Schemas (2006)

Die Integration unabhängig voneinander entwickelter Datenquellen stellt uns vor viele Probleme, die das Ergebnis verschiedener Arten von Heterogenität sind. Eine der größten Herausforderungen ist Schema Matching: der halbautomatische Prozess, in dem semantische Beziehungen zwischen Attributen in heterogenen Schemata erkannt werden.

Verschiedene Lösungen, die Schemainformationen ausnutzen oder spezifische Eigenschaften aus Attributwerten extrahieren, wurden in der Literatur beschrieben. In dieser Dissertation wird ein neuartiger Schema-Matching-Algorithmus vorgestellt, welcher ‘unscharfe’ Duplikate, also unterschiedliche Repräsentationen der gleichen Realwelt-Entität, ausnutzt.

In dieser Arbeit wird der DUMAS table matcher, welcher Attributkorrespondenzen zwischen zwei Tabellen herstellt, beschrieben. Das Auffinden der Duplikate, die dann für das Schema Matching benutzt werden können, ist eine herausfordernde Aufgabe, weil die semantischen Beziehungen zwischen den Tabellen nicht bekannt sind und somit bekannte Duplikaterkennungsverfahren nicht angewandt werden können. Das neue Problem der Duplikaterkennung zwischen nicht angeglichenen Tabellen und ein Algorithmus, der die Top-k Duplikate findet, wird beschrieben. Die Attributkorrespondenzen zwischen den beiden Tabellen werden in einem folgenden Schritt aus den Duplikaten extrahiert.

Der DUMAS schema matcher erweitert den duplikat-basierten MatchingAnsatz auf komplexe Schemata, welche aus mehreren Tabellen bestehen. Das Auffinden von Korrespondenzen zwischen komplexen Schemata wirft neue Probleme auf, die bei einzelnen Tabellen nicht auftreten. Somit ist die direkte Anwendung des DUMAS table matcher nicht möglich. Stattdessen werden Heuristiken benutzt, mit deren Hilfe entschieden werden kann, ob einem Matching zwischen zwei Tabellen vertraut werden kann. Basierend darauf wird ein Algorithmus entwickelt, der Attributkorrespondenzen zwischen komplexen Schemata findet.

Die beiden bisher beschriebenen Algorithmen sind auf einfache (1:1) Korrespondenzen beschränkt. Weil komplexe (1:n oder m:n) Korrespondenzen in der Praxis vorkommen, wurde der DUMAS complex matcher entwickelt. Dieser Matcher benutzt das Ergebnis des DUMAS table matcher und verbessert das Ergebnis, indem einzelne Attribute kombiniert werden. Auf diese Weise werden komplexe Korrespondenzen gebildet. Weil der Raum der möglichen komplexen Matchings sehr groß ist, wurden Heuristiken entwickelt, mit deren Hilfe die Anzahl der zu betrachtenden Attributkombinationen eingeschränkt werden.