Prof. Dr. Felix Naumann

Hazar Harmouch - Single-column Data Profiling (2020)

Link to thesis​​​​​​

The research area of ​​data profiling consists of a large set of methods and processes to examine a given dataset and determine metadata about it. Typically, different data profiling tasks address different kinds of metadata, comprising either various statistics about individual columns (Single-column Analysis) or relationships among them (Dependency Discovery). Among the basic statistics about a column are data type, header, the number of unique values ​​(the column's cardinality), maximum and minimum values, the number of NULL values, and the value distribution. Dependencies involve, for instance, functional dependencies (FDs), inclusion dependencies (INDs), and their approximate versions.

Data profiling has a wide range of conventional use cases, namely data exploration, cleaning, and integration. The produced metadata is also useful for database management and schema reverse engineering. Data profiling has also more novel use cases, such as big data analytics. The generated metadata describes the structure of the data at hand, how to import it, what it is about, and how much of it there is. Thus, data profiling can be considered as an important preparatory task for many data analysis and mining scenarios to assess which data might be useful and to reveal and understand a new dataset's characteristics.

In this thesis, the main focus is on the single-column analysis class of data profiling tasks. We study the impact and the extraction of three of the most important metadata about a column, namely the cardinality, the header, and the number of NULL values.

First, we present a detailed experimental study of twelve cardinality estimation algorithms. We classify the algorithms and analyse their efficiency, scaling far beyond the original experiments and testing theoretical guarantees. Our results highlight their trade-offs and point out the possibility to create a parallel or a distributed version of these algorithms to cope with the growing size of modern datasets.

Then, we present a fully automated, multi-phase system to discover human-understandable, representative, and consistent headers for a target table in cases where headers are missing, meaningless, or unrepresentative for the column values. Our evaluation on Wikipedia tables shows that 60% of the automatically discovered schemata are exact and complete. Considering more schema candidates, top-5 for example, increases this percentage to 72%.

Finally, we formally and experimentally show the ghost and fake FDs phenomenon caused by FD discovery over datasets with missing values. We propose two efficient scores, probabilistic and likelihood-based, for estimating the genuineness of a discovered FD. Our extensive set of experiments on real-world and semi-synthetic datasets show the effectiveness and efficiency of these scores.

Sebastian Kruse - Scalable Data Profiling (2018)

Link to thesis

Data profiling is the act of extracting structural metadata from datasets. Structural metadata, such as data dependencies and statistics, can support data management operations, such as data integration and data cleaning. Data management often is the most time-consuming activity in any data- related project. Its support is extremely valuable in our data-driven world, so that more time can be spent on the actual utilization of the data, e.g., building analytical models. In most scenarios, however, structural metadata is not given and must be extracted first. Therefore, efficient data profiling methods are highly desirable.

Data profiling is a computationally expensive problem; in fact, most depen- dency discovery problems entail search spaces that grow exponentially in the number of attributes. To this end, this thesis introduces novel discovery algorithms for various types of data dependencies – namely inclusion dependencies, conditional inclusion dependencies, partial functional dependencies, and partial unique column combinations – that considerably improve over state-of-the-art algorithms in terms of efficiency and that scale to datasets that cannot be processed by existing algorithms. The key to those improvements are not only algorithmic innovations, such as novel pruning rules or traversal strategies, but also algorithm designs tailored for distributed execution. While distributed data profiling has been mostly neglected by previous works, it is a logical consequence on the face of recent hardware trends and the computational hardness of dependency discovery.

To demonstrate the utility of data profiling for data management, this thesis furthermore presents Metacrate, a database for structural metadata. Its salient features are its flexible data model, the capability to integrate various kinds of structural metadata, and its rich metadata analytics library. We show how to perform a data anamnesis of unknown, complex datasets based on this technology. In particular, we describe in detail how to reconstruct the schemata and assess their quality as part of the data anamnesis.

The data profiling algorithms and Metacrate have been carefully implemented, integrated with the Metanome data profiling tool, and are available as free software. In that way, we intend to allow for easy repeatability of our research results and also provide them for actual usage in real-world data-related projects.

Zhe Zuo - From unstructured to structured: Context-based Named Entity Mining from Text (2018)

Link to thesis

With recent advances in the area of information extraction, automatically extracting structured information from a vast amount of unstructured textual data becomes an important task, which is infeasible for humans to capture all information manually. Named entities (e.g., persons, organizations, and locations), which are crucial compo- nents in texts, are usually the subjects of structured information from textual documents. Therefore, the task of named entity mining receives much attention. It consists of three major subtasks, which are named entity recognition, named entity linking, and relation extraction.

These three tasks build up an entire pipeline of a named entity mining system, where each of them has its challenges and can be employed for further applications. As a fundamental task in the natural language processing domain, studies on named entity recognition have a long history, and many existing approaches produce reliable results. The task is aiming to extract mentions of named entities in text and identify their types. Named entity linking recently received much attention with the development of knowledge bases that contain rich information about entities. The goal is to disambiguate mentions of named entities and to link them to the corresponding entries in a knowledge base. Relation extraction, as the final step of named entity mining, is a highly challenging task, which is to extract semantic relations between named entities, e.g., the ownership relation between two companies.

In this thesis, we review the state-of-the-art of named entity mining domain in detail, including valuable features, techniques, evaluation methodologies, and so on. Furthermore, we present two of our approaches that focus on the named entity linking and relation extraction tasks separately.

To solve the named entity linking task, we propose the entity linking technique, BEL, which operates on a textual range of relevant terms and aggregates decisions from an ensemble of simple classifiers. Each of the classifiers operates on a randomly sampled subset of the above range. In extensive experiments on hand-labeled and benchmark datasets, our approach outperformed state-of-the-art entity linking techniques, both in terms of quality and efficiency.

For the task of relation extraction, we focus on extracting a specific group of difficult relation types, business relations between companies. These relations can be used to gain valuable insight into the interactions between companies and perform complex analytics, such as predicting risk or valuating companies. Our semi-supervised strategy can extract business relations between companies based on only a few user-provided seed company pairs. By doing so, we also provide a solution for the problem of determining the direction of asymmetric relations, such as the ownership of relation. We improve the reliability of the extraction process by using a holistic pattern identification method, which classifies the generated extraction patterns. Our experiments show that we can accurately and reliably extract new entity pairs occurring in the target relation by using as few as five labeled seed pairs.

Toni Grütze - Adding Value to Text with User-Generated Content (2018)

Link to thesis

In recent years, the ever-growing amount of documents on the Web as well as in closed systems for private or business contexts led to a considerable increase of valuable textual information about topics, events, and entities. It is a truism that the majority of information (i.e., business-relevant data) is only available in unstructured textual form. The text mining research field comprises various practice areas that have the common goal of harvesting high-quality information from textual data. These information help addressing users’ information needs.

In this thesis, we utilize the knowledge represented in user-generated content (ugc) originating from various social media services to improve text mining results. These social media platforms provide a plethora of information with varying focuses. In many cases, an essential feature of such platforms is to share relevant content with a peer group. Thus, the data exchanged in these communities tend to be focused on the interests of the user base. The popularity of social media services is growing continuously and the inherent knowledge is available to be utilized. We show that this knowledge can be used for three different tasks.

Initially, we demonstrate that when searching persons with ambiguous names, the information from Wikipedia can be bootstrapped to group web search results according to the individuals occurring in the documents. We introduce two models and different means to handle persons missing in the ugc source. We show that the proposed approaches outperform traditional algorithms for search result clustering. Secondly, we discuss how the categorization of texts according to continuously changing community-generated folksonomies helps users to identify new information related to their interests. We specifically target temporal changes in the ugc and show how they influence the quality of different tag recommendation approaches. Finally, we introduce an algorithm to attempt the entity linking problem, a necessity for harvesting entity knowledge from large text collections. The goal is the linkage of mentions within the documents with their real-world entities. A major focus lies on the efficient derivation of coherent links.

For each of the contributions, we provide a wide range of experiments on various text corpora as well as different sources of ugc. The evaluation shows the added value that the usage of these sources provides and confirms the appropriateness of leveraging user-generated content to serve different information needs.

Tobias Zieger (geb. Vogel) - Self-Adaptive Data Quality: Automating Duplicate Detection (2018)

Link to thesis

Carrying out business processes successfully is closely linked to the quality of the data inventory in an organization. Lacks in data quality lead to problems: Incorrect address data prevents (timely) shipments to customers. Erroneous orders lead to returns and thus to unnecessary effort. Wrong pricing forces companies to miss out on revenues or to impair customer satisfaction. If orders or customer records cannot be retrieved, complaint management takes longer. Due to erroneous inventories, too few or too much supplies might be reordered.

A special problem with data quality and the reason for many of the issues mentioned above are duplicates in databases. Duplicates are different representations of same real-world objects in a dataset. However, these representations differ from each other and are for that reason hard to match by a computer. Moreover, the number of required comparisons to find those duplicates grows with the square of the dataset size. To cleanse the data, these duplicates must be detected and removed. Duplicate detection is a very laborious process. To achieve satisfactory results, appropriate software must be created and configured (similarity measures, partitioning keys, thresholds, etc.). Both requires much manual effort and experience. This thesis addresses automation of parameter selection for duplicate detection and presents several novel approaches that eliminate the need for human experience in parts of the duplicate detection process.

A pre-processing step is introduced that analyzes the datasets in question and classifies their attributes semantically. Not only do these annotations help understanding the respective datasets, but they also facilitate subsequent steps, for example, by selecting appropriate similarity measures or normalizing the data upfront. This approach works without schema information.

Following that, we show a partitioning technique that strongly reduces the number of pair comparisons for the duplicate detection process. The approach automatically finds particularly suitable partitioning keys that simultaneously allow for effective and efficient duplicate retrieval. By means of a user study, we demonstrate that this technique finds partitioning keys that outperform expert suggestions and additionally does not need manual configuration. Furthermore, this approach can be applied independently of the attribute types.

To measure the success of a duplicate detection process and to execute the described partitioning approach, a gold standard is required that provides information about the actual duplicates in a training dataset. This thesis presents a technique that uses existing duplicate detection results and crowdsourcing to create a near gold standard that can be used for the purposes above. Another part of the thesis describes and evaluates strategies how to reduce these crowdsourcing costs and to achieve a consensus with less effort.

Thorsten Papenbrock - Data Profiling - Efficient Discovery of Dependencies (2017)

Link to thesis

Data profiling is the computer science discipline of analyzing a given dataset for its metadata. The types of metadata range from basic statistics, such as tuple counts, column aggregations, and value distributions, to much more complex structures, in particular inclusion dependencies (INDs), unique column combinations (UCCs), and functional dependencies (FDs). If present, these statistics and structures serve to efficiently store, query, change, and understand the data. Most datasets, however, do not provide their metadata explicitly so that data scientists need to profile them.

While basic statistics are relatively easy to calculate, more complex structures present difficult, mostly NP-complete discovery tasks; even with good domain knowledge, it is hardly possible to detect them manually. Therefore, various profiling algorithms have been developed to automate the discovery. None of them, however, can process datasets of typical real-world size, because their resource consumptions and/or execution times exceed effective limits.

In this thesis, we propose novel profiling algorithms that automatically discover the three most popular types of complex metadata, namely INDs, UCCs, and FDs, which all describe different kinds of key dependencies. The task is to extract all valid occurrences from a given relational instance. The three algorithms build upon known techniques from related work and complement them with algorithmic paradigms, such as divide & conquer, hybrid search, progressivity, memory sensitivity, parallelization, and additional pruning to greatly improve upon current limitations. Our experiments show that the proposed algorithms are orders of magnitude faster than related work. They are, in particular, now able to process datasets of real-world, i.e., multiple gigabytes size with reasonable memory and time consumption.

Due to the importance of data profiling in practice, industry has built various profiling tools to support data scientists in their quest for metadata. These tools provide good support for basic statistics and they are also able to validate individual dependencies, but they lack real discovery features even though some fundamental discovery techniques are known for more than 15 years. To close this gap, we developed Metanome, an extensible profiling platform that incorporates not only our own algorithms but also many further algorithms from other researchers. With Metanome, we make our research accessible to all data scientists and IT-professionals that are tasked with data profiling. Besides the actual metadata discovery, the platform also offers support for the ranking and visualization of metadata result sets.Being able to discover the entire set of syntactically valid metadata naturally introduces the subsequent task of extracting only the semantically meaningful parts. This is challenge, because the complete metadata results are surprisingly large (sometimes larger than the datasets itself) and judging their use case dependent semantic relevance is difficult. To show that the completeness of these metadata sets is extremely valuable for their usage, we finally exemplify the efficient processing and effective assessment of functional dependencies for the use case of schema normalization.

Arvid Heise – Data Cleansing and Integration Operators for a Parallel Data Analytics Platform (2015)

Link to thesis

The data quality of real-world datasets need to be constantly monitored and maintained to allow organizations and individuals to reliably use their data. Especially, data integration projects suffer from poor initial data quality and as a consequence consume more effort and money. Commercial products and research prototypes for data cleansing and integration help users to improve the quality of individual and combined datasets. They can be divided into either standalone systems or database management system (DBMS) extensions. On the one hand, standalone systems do not interact well with DBMS and require time-consuming data imports and exports. On the other hand, DBMS extensions are often limited by the underlying system and do not cover the full set of data cleansing and integration tasks.

We overcome both limitations by implementing a concise set of five data cleansing and integration operators on the parallel data analytics platform Stratosphere. We define the semantics of the operators, present their parallel implementation, and devise optimization techniques for individual operators and combinations thereof. Users specify declarative queries in our query language METEOR with our new operators to improve the data quality of individual datasets or integrate them to larger datasets. By integrating the data cleansing operators into the higher level language layer of Stratosphere, users can easily combine cleansing operators with operators from other domains, such as information extraction, to complex data flows. Through a generic description of the operators, the Stratosphere optimizer reorders operators even from different domains to find better query plans.

As a case study, we reimplemented a part of the large Open Government Data integration project GovWILD with our new operators and show that our queries run significantly faster than the original GovWILD queries, which rely on relational operators. Evaluation reveals that our operators exhibit good scalability on up to 100 cores, so that even larger inputs can be efficiently processed by scaling out to more machines. Finally, our scripts are considerably shorter than the original GovWILD scripts, which results in better maintainability of the scripts.

Alexander Albrecht - Understanding and Managing Extract-Transform-Load Systems (2014)

Extract-Transform-Load (ETL) systems are visual programming tools that allow the definition of complex workflows to extract, transform, and load heterogeneous data from one or more sources into target databases by placing, specifying, and connecting sources, transformations, and targets. ETL workflows are stored in repositories to be executed periodically, e.g., every hour, daily, or once a week. In the course of a complex data warehousing project up to several hundred ETL workflows are created by different individuals and stored in such repositories. Moreover, over time the created ETL workflows often become larger, more complex and thus difficult to comprehend for developers.

In this thesis we address the problem of ETL workflow comprehension in the context of managing large repositories of ETL workflows. The main contributions of this thesis include the implementation of a generic ETL management framework and the development of novel techniques for ETL workflow comprehension, such as schema decryption and ETL workflow abstraction. Our schema decryption method supports ETL developers in understanding cryptic schema labels of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. Our ETL workflow abstraction approach provides a simplified, easier to read ETL workflow by partitioning all transformations in an ETL workflow into a few informative sub-workflows. Furthermore, our approach supports an interactive zoom-in/out capability allowing an ETL developer to identify and focus only on relevant parts of a given ETL workflow.

Johannes Lorey - What’s in a Query: Analyzing, Predicting, and Managing Linked Data Access (2014)

Link to thesis

The term Linked Data refers to connected information sources comprising structured data about a wide range of topics and for a multitude of applications. In recent years, the conceptional and technical foundations of Linked Data have been formalized and refined. To this end, well-known technologies have been established, such as the Resource Description Framework (RDF) as a Linked Data model or the SPARQL Protocol and RDF Query Language (Sparql) for retrieving this information.

Whereas most research has been conducted in the area of generating and publishing Linked Data, this thesis presents novel approaches for improved management. In particular, we illustrate new methods for analyzing and processing Sparql queries. Here, we present two algorithms suitable for identifying structural relationships between these queries. Both algorithms are applied to a large number of real-world requests to evaluate the performance of the approaches and the quality of their results. Based on this, we introduce different strategies enabling optimized access of Linked Data sources. We demonstrate how the presented approach facilitates effective utilization of Sparql endpoints by prefetching results relevant for multiple subsequent requests.

Furthermore, we contribute a set of metrics for determining technical characteristics of such knowledge bases. To this end, we devise practical heuristics and validate them through thorough analysis of real-world data sources. We discuss the findings and evaluate their impact on utilizing the endpoints. Moreover, we detail the adoption of a scalable infrastructure for improving Linked Data discovery and consumption. As we outline in an exemplary use case, this platform is eligible both for processing and provisioning the corresponding information.

Ziawasch Abedjan - Improving RDF Data with Data Mining (2014)

Gewinner des Dissertationspreises der Universitätsgesellschaft für die beste Dissertation an der Universität Potsdam in 2014

Link to thesis

Linked Open Data (LOD) comprises very many and often large public data sets and knowledge bases. Those datasets are mostly presented in the RDF triple structure of subject, predicate, and object, where each triple represents a statement or fact. Unfortunately, the heterogeneity of available open data requires significant integration steps before it can be used in applications. Meta information, such as ontological definitions and exact range definitions of predicates, are desirable and ideally provided by an ontology. However in the context of LOD, ontologies are often incomplete or simply not available. Thus, it is useful to automatically generate meta information, such as ontological dependencies, range definitions, and topical classifications.

Association rule mining, which was originally applied for sales analysis on transactional databases, is a promising and novel technique to explore such data. We designed an adaptation of this technique for mining RDF data and introduce the concept of “mining configurations”, which allows us to mine Rdf data sets in various ways. Different configurations enable us to identify schema and value dependencies that in combination result in interesting use cases. To this end, we present rule-based approaches for auto-completion, data enrichment, ontology improvement, and query relaxation. Auto-completion remedies the problem of inconsistent ontology usage, providing an editing user with a sorted list of commonly used predicates. A combination of different configurations step extends this approach to create completely new facts for a knowledge base. We present two approaches for fact generation, a user-based approach where a user selects the entity to be amended with new facts and a data-driven approach where an algorithm discovers entities that have to be amended with missing facts.

As knowledge bases constantly grow and evolve, another approach to improve the usage of RDF data is to improve existing ontologies. Here, we present an association rule based approach to reconcile ontology and data. Interlacing different mining configurations, we infer an algorithm to discover synonymously used predicates. Those predicates can be used to expand query results and to support users during query formulation.

We provide a wide range of experiments on real world datasets for each use case. The experiments and evaluations show the added value of association rule mining for the integration and usability of Rdf data and confirm the appropriateness of our mining configuration methodology.

Christoph Böhm - Enriching the Web of Data with Topics and Links (2013)

Link to thesis

This thesis presents novel ideas and research findings for the Web of Data – a global data space spanning many so-called Linked Open Data sources. Linked Open Data adheres to a set of simple principles to allow easy access and reuse for data published on the Web. Linked Open Data is by now an established concept and many (mostly academic) publishers adopted the principles building a powerful web of structured knowledge available to everybody. However, so far, Linked Open Data does not yet play a significant role among common Web technologies that currently facilitate a high-standard Web experience.

In this work, we thoroughly discuss the state-of-the-art for Linked Open Data and highlight several shortcomings – some of them we tackle in the main part of this work. First, we propose a novel type of data source meta-information, namely the topics of a dataset. This information could be published with dataset descriptions and support a variety of use cases, such as data source exploration and selection. For the topic retrieval, we present an approach coined Annotated Pattern Percolation (APP), which we evaluate with respect to topics extracted from Wikipedia portals.

Second, we contribute to entity linking research by presenting an optimization model for joint entity linking, showing its hardness, and proposing three heuristics implemented in the LINked Data Alignment (LINDA) system. Our first solution can exploit multi-core machines, whereas the second and third approach are designed to run in a distributed shared-nothing environment. We discuss and evaluate the properties of our approaches leading to recommendations which algorithm to use in a specific scenario. The distributed algorithms are among the first of their kind, i.e., approaches for joint entity linking in a distributed fashion. Also, we illustrate that we can tackle the entity linking problem on the very large scale with data comprising more than 100 millions of entity representations from very many sources.

Finally, we approach a sub-problem of entity linking, namely the alignment of concepts. We again target a method that looks at the data in its entirety and does not neglect existing relations. Also, this concept alignment method shall execute very fast to serve as a preprocessing for further computations. Our approach, called Holistic Concept Matching (HCM), achieves the required speed through grouping the input by comparing so-called knowledge representations. Within the groups, we perform complex similarity computations, relation conclusions, and detect semantic contradictions. The quality of our result is again evaluated on a large and heterogeneous dataset from the real Web. In summary, this work contributes a set of techniques for enhancing the current state of the Web of Data. All approaches have been tested on large and heterogeneous real-world input.

Jana Bauckmann - Dependency discovery for data integration (2013)

Link to thesis

Data integration aims to combine data of different sources and to provide users with a unified view on these data. This task is as challenging as valuable. In this thesis we propose algorithms for dependency discovery to provide necessary information for data integration. We focus on inclusion dependencies (INDs) in general and a special form named conditional inclusion dependencies (CINDs): (i) INDs enable the discovery of structure in a given schema. (ii) INDs and CINDs support the discovery of cross-references or links between schemas.

An IND “A in B” simply states that all values of attribute A are included in the set of values of attribute B. We propose an algorithm that discovers all inclusion dependencies in a relational data source. The challenge of this task is the complexity of testing all attribute pairs and further of comparing all of each attribute pair's values.

The complexity of existing approaches depends on the number of attribute pairs, while ours depends only on the number of attributes. Thus, our algorithm enables to profile entirely unknown data sources with large schemas by discovering all INDs. Further, we provide an approach to extract foreign keys from the identified INDs.

We extend our IND discovery algorithm to also find three special types of INDs: (i) Composite INDs, such as “AB in CD”, (ii) approximate INDs that allow a certain amount of values of A to be not included in B, and (iii) prefix and suffix INDs that represent special cross-references between schemas.

Conditional inclusion dependencies are inclusion dependencies with a limited scope defined by conditions over several attributes. Only the matching part of the instance must adhere the dependency. We generalize the definition of CINDs distinguishing covering and completeness conditions and define quality measures for conditions. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. The challenge for this task is twofold: (i) Which (and how many) attributes should be used for the conditions? (ii) Which attribute values should be chosen for the conditions? Previous approaches rely on pre-selected condition attributes or can only discover conditions applying to quality thresholds of 100%.

Our approaches were motivated by two application domains: data integration in the life sciences and link discovery for linked open data. We show the efficiency and the benefits of our approaches for use cases in these domains.

Dustin Lange - Effective and efficient similarity search in databases (2013)

Given a large set of records in a database and a query record, similarity search aims to find all records sufficiently similar to the query record. To solve this problem, two main aspects need to be considered: First, to perform effective search, the set of relevant records is defined using a similarity measure. Second, an efficient access method is to be found that performs only few database accesses and comparisons using the similarity measure. This thesis solves both aspects with an emphasis on the latter.

In the first part of this thesis, a frequency-aware similarity measure is introduced. Compared record pairs are partitioned according to frequencies of attribute values. For each partition, a different similarity measure is created: machine learning techniques combine a set of base similarity measures into an overall similarity measure. After that, a similarity index for string attributes is proposed, the State Set Index (SSI), which is based on a trie (prefix tree) that is interpreted as a nondeterministic finite automaton. For processing range queries, the notion of query plans is introduced in this thesis to describe which similarity indexes to access and which thresholds to apply. The query result should be as complete as possible under some cost threshold. Two query planning variants are introduced: (1) Static planning selects a plan at compile time that is used for all queries. (2) Query-specific planning selects a different plan for each query. For answering top-k queries, the Bulk Sorted Access Algorithm (BSA) is introduced, which retrieves large chunks of records from the similarity indexes using fixed thresholds, and which focuses its efforts on records that are ranked high in more than one attribute and thus promising candidates.

The described components form a complete similarity search system. Based on prototypical implementations, this thesis shows comparative evaluation results for all proposed approaches on different real-world data sets, one of which is a large person data set from a German credit rating agency.

Mohammed AbuJarour - Enriched Service Descriptions: Sources, Approaches, and Usages (2011)

Service descriptions play a crucial role in several tasks in Service-oriented Architecture (SOA), e.g., for service discovery, service selection, service composition, etc. However, it has been observed that service providers – the main source of information about web services – typically release poor service descriptions, i.e., mostly technical-oriented descriptions. To tackle this lack of rich service descriptions, several approaches have been proposed to enrich poor service descriptions with additional information from other sources, e.g., community annotations, domain experts, etc. In our research, we introduce a novel approach to gather and generate additional information about web services used in collaborative application domains from multiple sources and integrate them in unified, rich service descriptions. These sources are: Websites of service providers, invocation analysis, and business processes of service consumers.

We have developed a focused crawler that automatically collects public web services from the websites of their providers and registers them in our public service registry, Depot. In addition to web services, textual annotations are extracted from the same crawled web pages and attached to their corresponding web services. This information represents the technical perspective of web services.

With the increasing number and complexity of web services, the importance of instance-level information about web services has increased. Such instance-level information, e.g., examples of input values, formats, etc., are usually not incorporated in service descriptions. We have introduced a set of methods to analyze invocations of data web services and extract useful instance-level information about them based on their invocations, e.g., tags, valid values for inputs, etc.

Traditionally, service consumers might provide ratings and feedback about web services they use in their business processes. Although this information is useful, additional valuable information can be generated from business processes, e.g., context and consuming tasks. To extract this knowledge about web services from their consuming business processes, we integrate Depot with the open-source online modeling framework Oryx. The information extracted from business processes of service consumers represents the business perspective of the used web services.

The information we gather and generate from the aforementioned sources is used to enrich poor service descriptions. With these enriched service descriptions, several tasks can be enhanced, e.g., service discovery, selection, recommendation, etc. For instance, service discovery can be enhanced by enabling service exploration through automatic categorization and tagging of web services. Moreover, we are able to provide multi-dimensional service exploration by combining the category, provider, and tags of web services and enable additional refinement through keywords for full-text search.

Armin Roth - Efficient Query Answering in Peer Data Management Systems (2010)

Peer data management systems (PDMS) consist of a highly dynamic set of autonomous, heterogeneous peers connected with schema mappings. Queries submitted at a peer are answered with data residing at that peer and by passing the queries to neighboring peers. Pdms are the most general architecture for distributed integrated information systems. With no need for central coordination, PDMS are highly flexible. However, due to the typical massive redundancy in mapping paths, PDMS tend to be very inefficient in computing the complete query result as the number of peers increases. Additionally, information loss is cumulated along mapping paths due to selections and projections in the mappings.

Users usually accept concessions on the completeness of query answers in large-scale data sharing settings. Our approach turns completeness into an optimization goal and thus trades off benefit and cost of query answering. To this end, we propose several strategies that guide peers in their decision to which neighbors rewritten queries should be sent. In effect, the peers prune mappings that are expected to contribute few data. We propose a query optimization strategy that limits resource consumption and show that it can drastically increase efficiency while still yielding satisfying completeness of the query result.

To estimate the potential data contribution of mappings, we adopted self-tuning histograms for cardinality estimation. We developed techniques that ensure sufficient query feedback to adapt these statistics to massive changes in a PDMS. Additionally, histograms can serve to maintain statistics on data overlap between alternative mapping paths. Building on them, redundant query processing is reduced by avoiding overlapping areas of the multi-dimensional data space

Falk Brauer - Adaptive Extraktion und Identifikation von Entitäten in Textdaten im Umfeld der Enterprise Search (2010)

Die automatische Informationsextraktion (IE) aus unstrukturierten Texten ermöglicht völlig neue Wege, auf relevante Informationen zuzugreifen und deren Inhalte zu analysieren, die weit über bisherige Verfahren zur Stichwort-basierten Dokumentsuche hinausgehen. Die Entwicklung von Programmen zur Extraktion von maschinenlesbaren Daten aus Texten erfordert jedoch nach wie vor die Entwicklung von domänenspezifischen Extraktionsprogrammen.

Insbesondere im Bereich der Enterprise Search (der Informationssuche im Unternehmensumfeld), in dem eine große Menge von heterogenen Dokumenttypen existiert, ist es oft notwendig, ad-hoc Programmmodule zur Extraktion von geschäftsrelevanten Entitäten zu entwickeln, die mit generischen Modulen in monolithischen IE-Systemen kombiniert werden. Dieser Umstand ist insbesondere kritisch, da potenziell für jeden einzelnen Anwendungsfall ein von Grund auf neues IE-System entwickelt werden muss.

Die vorliegende Dissertation untersucht die effiziente Entwicklung und Ausführung von IE-Systemen im Kontext der Enterprise Search und effektive Methoden zur Ausnutzung bekannter strukturierter Daten im Unternehmenskontext für die Extraktion und Identifikation von geschäftsrelevanten Entitäten in Dokumenten. Grundlage der Arbeit ist eine neuartige Plattform zur Komposition von IE-Systemen auf Basis der Beschreibung des Datenflusses zwischen generischen und anwendungsspezifischen IE-Modulen. Die Plattform unterstützt insbesondere die Entwicklung und Wiederverwendung von generischen IE-Modulen und zeichnet sich durch eine höhere Flexibilität und Ausdrucksmächtigkeit im Vergleich zu vorherigen Methoden aus.

Ein in der Dissertation entwickeltes Verfahren zur Dokumentverarbeitung interpretiert den Datenaustausch zwischen IE-Modulen als Datenströme und ermöglicht damit die weitgehende Parallelisierung von einzelnen Modulen. Die autonome Ausführung der Module führt zu einer wesentlichen Beschleunigung der Verarbeitung von Einzeldokumenten und verbesserten Antwortzeiten, z.B. für Extraktionsdienste. Bisherige Ansätze untersuchen lediglich die Steigerung des durchschnittlichen Dokumentendurchsatzes durch verteilte Ausführung von Instanzen eines IE-Systems.

Die Informationsextraktion im Kontext der Enterprise Search unterscheidet sich z.B. von der Extraktion aus dem World Wide Web (WWW) dadurch, dass in der Regel strukturierte Referenzdaten, z.B. in Form von Unternehmensdatenbanken oder Terminologien zur Verfügung stehen, die oft auch die Beziehungen von Entitäten beschreiben. Entitäten im Unternehmensumfeld haben weiterhin bestimmte Charakteristiken: Eine Klasse von relevanten Entitäten folgt bestimmten Bildungsvorschriften, die nicht immer bekannt sind, auf die aber mit Hilfe von bekannten Beispielentitäten geschlossen werden kann, so dass unbekannte Entitäten desselben Typs extrahiert werden können. Die Bezeichner der anderen Klasse von Entitäten haben eher umschreibenden Charakter. Die korrespondierenden Umschreibungen in Texten können variieren, wodurch eine Identifikation derartiger Entitäten oft erschwert wird.

Zur effizienteren Entwicklung von IE-Systemen wird in der Dissertation ein Verfahren untersucht, das alleine anhand von Beispielentitäten effektive Reguläre Ausdrücke zur Extraktion von unbekannten Entitäten erlernt und damit den manuellen Aufwand in derartigen Anwendungsfällen minimiert. Verschiedene Generalisierungs- und Spezialisierungsheuristiken erkennen Muster auf verschiedenen Abstraktionsebenen und schaffen dadurch einen Ausgleich zwischen Genauigkeit und Vollständigkeit bei der Extraktion. Bekannte Regellernverfahren im Bereich der Informationsextraktion unterstützen die beschriebenen Problemstellungen nicht, sondern benötigen einen (annotierten) Dokumentenkorpus.

Eine Methode zur Identifikation von Entitäten, die durch Graph-strukturierte Referenzdaten vordefiniert sind, wird als dritter Schwerpunkt untersucht. Es werden Verfahren konzipiert, welche über einen exakten Zeichenkettenvergleich zwischen Text und Referenzdatensatz hinausgehen und Teilübereinstimmungen und Beziehungen zwischen Entitäten zur Identifikation und Disambiguierung heranziehen. Das in der Arbeit vorgestellte Verfahren ist bisherigen Ansätzen hinsichtlich der Genauigkeit und Vollständigkeit bei der Identifikation überlegen.

Jens Bleiholder - Data Fusion and Conflict Resolution in Integrated Information Systems (2010)

The development of the Internet in recent years has made it possible and useful to access many diferent information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most stops short of the actual integration of available data.

Data fusion is the process of combining multiple records representing the same real-world object into a single, consistent, and clean representation. The problem that is considered when fusing data is the problem of handling data conficts (uncertainties and contradictions) that may exist among the different representations. In preliminary steps, duplicate detection techniques help in identifying different representations of same real-world objects and schema mappings are used to identify common representations if data originates in diferent sources.

This thesis first formalizes data fusion as part of a data integration process and describes and categorizes different conflict handling strategies when fusing data. Based on this categorization, three different database operators are presented in more detail: First, the minimum union operator and the complement union operator, which both are able to handle different special cases of uncertainties and implement one specific strategy. They condense the set of multiple representations by removing unnecessary null values. In case of minimum union, subsumed tuples are removed as a subsumed tuple does not contain any new information. In case of complement union, two tuples are combined into one, if the attribute values complement each other, i.e., if for each attribute the values coincide or there is a value in only one of the tuples. The result of both operators contains fewer null values. We show how the operators can be moved around in query plans. Second, we present the data fusion operator, which is able to implement many different conflict handling strategies and can be used to really resolve conflicts, resulting in one single representation for each real-world object. We also show how the operator can be moved around in query plans. For all three operators we give algorithms and show how to implement them as primitives in a database management system.

The feasibility and usefulness of the operators is shown by including and evaluating them in two different integrated information systems developed within our group: the HumMer system is an integrated information system that allows the user to integrate and fuse data from diferent data sources using a wizard-like interface. The FuSem system allows the user to apply different fusion semantics and their corresponding operators to its data, visualize the differences of the fused results and finally come up with the most satisfying result for the task at hand.

The thesis covers the process of data fusion in its entirety: the user chooses one or more strategies to fuse data and expresses the strategies using a relational integrated information system. Then, the system executes the operation in an efficient way and finally lets the user look at the result and explore differences between different fusion techniques/strategies.

Melanie Herschel (b. Weis) - Duplicate Detection in XML Data (2007)

Duplicate detection consists in detecting multiple representations of a same real-world object, and that for every object represented in a data source. Duplicate detection is relevant in data cleaning and data integration applications and has been studied extensively for relational data describing a single type of object in a single table. Our research focuses on iterative duplicate detection in XML data. We consider detecting duplicates in multiple types of objects related to each other and devise methods adapted to semi-structured XML data. Relationships between different types of objects either form a hierarchical structure (given by the hierarchical XML structure) or a graph structure (e.g.,given by referential constraints).

Iterative duplicate detection require a similarity measure to compare pairs of object representations, called candidates, based on descriptive information of a candidate. The distinction between candidates and their description is not straight-forward in XML, but we show that we can semi-automatically determine these descriptions using heuristics and conditions.

Second, we define a similarity measure that is suited for XML data and that considers relationships between candidates. It considers data comparability, data relevance, data similarity, and distinguishes between missing and contradictory data. Experimental evaluation shows that our similarity measure outperforms existing similarity measures in terms of effectiveness.

To avoid pairwise comparisons and thereby improve efficiency, but without compromising effectiveness, we propose three comparison strategies: The top-down algorithm is suited for hierarchically related candidates where nesting represents 1:N relationships, whereas the bottom-up algorithm is suited when nested candidates actually exist in an M:N relationships in the real world. When candidate relationships form a graph, the Reconsidering Algorithm re-compares candidate pairs to improve effectiveness. Using a comparison order that reduces the number of re-comparisons nevertheless allows efficient and effective duplicate detection.

To scale to large amounts of data, we propose methods that interact with a database to handle retrieval, classification, and update of iteratively compared candidate pairs. Empirical evaluation shows that these methods scale linearly in time with the number of candidates, the connectivity of the graph, and the fraction of duplicates among all candidates. We are the first to obtain linear behavior for all these parameters, so, in summary, this thesis presents novel methods for effective, efficient, and scalable duplicate detection in graph, and in particular in XML data.

Finally, we present XClean, the first system for declarative XML data cleaning, for which we defined operators and a specification language that is compiled to XQuery.

Alexander Bilke - Duplicate-based Matching of Schemas (2006)

Die Integration unabhängig voneinander entwickelter Datenquellen stellt uns vor viele Probleme, die das Ergebnis verschiedener Arten von Heterogenität sind. Eine der größten Herausforderungen ist Schema Matching: der halbautomatische Prozess, in dem semantische Beziehungen zwischen Attributen in heterogenen Schemata erkannt werden.

Verschiedene Lösungen, die Schemainformationen ausnutzen oder spezifische Eigenschaften aus Attributwerten extrahieren, wurden in der Literatur beschrieben. In dieser Dissertation wird ein neuartiger Schema-Matching-Algorithmus vorgestellt, welcher ‘unscharfe’ Duplikate, also unterschiedliche Repräsentationen der gleichen Realwelt-Entität, ausnutzt.

In dieser Arbeit wird der DUMAS table matcher, welcher Attributkorrespondenzen zwischen zwei Tabellen herstellt, beschrieben. Das Auffinden der Duplikate, die dann für das Schema Matching benutzt werden können, ist eine herausfordernde Aufgabe, weil die semantischen Beziehungen zwischen den Tabellen nicht bekannt sind und somit bekannte Duplikaterkennungsverfahren nicht angewandt werden können. Das neue Problem der Duplikaterkennung zwischen nicht angeglichenen Tabellen und ein Algorithmus, der die Top-k Duplikate findet, wird beschrieben. Die Attributkorrespondenzen zwischen den beiden Tabellen werden in einem folgenden Schritt aus den Duplikaten extrahiert.

Der DUMAS schema matcher erweitert den duplikat-basierten MatchingAnsatz auf komplexe Schemata, welche aus mehreren Tabellen bestehen. Das Auffinden von Korrespondenzen zwischen komplexen Schemata wirft neue Probleme auf, die bei einzelnen Tabellen nicht auftreten. Somit ist die direkte Anwendung des DUMAS table matcher nicht möglich. Stattdessen werden Heuristiken benutzt, mit deren Hilfe entschieden werden kann, ob einem Matching zwischen zwei Tabellen vertraut werden kann. Basierend darauf wird ein Algorithmus entwickelt, der Attributkorrespondenzen zwischen komplexen Schemata findet.

Die beiden bisher beschriebenen Algorithmen sind auf einfache (1:1) Korrespondenzen beschränkt. Weil komplexe (1:n oder m:n) Korrespondenzen in der Praxis vorkommen, wurde der DUMAS complex matcher entwickelt. Dieser Matcher benutzt das Ergebnis des DUMAS table matcher und verbessert das Ergebnis, indem einzelne Attribute kombiniert werden. Auf diese Weise werden komplexe Korrespondenzen gebildet. Weil der Raum der möglichen komplexen Matchings sehr groß ist, wurden Heuristiken entwickelt, mit deren Hilfe die Anzahl der zu betrachtenden Attributkombinationen eingeschränkt werden.