- Patrick Schulze : Depth-First Discovery of Functional Dependencies
The discovery of functional dependencies (FDs) on relational data sources is part of the structural data analysis and an important problem in the domain of information system research. Functional dependencies exist in a database table if the values of a column combination distinctively determine the values of another column. Despite the variety of established FD discovery algorithms there is no approach that is able to efficiently determine the complete set of functional dependencies especially of large datasets. In this work we analyse Tane and FastFDs as representatives of two different strategies for the FD discovery in depth. Subsequently, the thesis highlights the relationship between unique column combinations and functional dependencies. Based on these insights, the principles of DUCC, the latest and most efficient algorithm for the discovery of unique column combinations, are used to design a novel FD discovery approach. This thesis reveals that on most of the real-world datasets, this approach called DFD performs better than the other algorithms Tane and FastFDs.
- Sebastian Kruse : Estimating the Complexity and Effort of Data Integration
Information is nowadays a valuable asset for companies, thus proper data management is an important factor for the success of businesses. Data integration is such a data management activity and has classical applications like ETL processes or unifying database after a company merger, but also gains significance through the increasing importance of recent trends like "Big data". However, data integration is typically a laborious task and although research has devised many tools that alleviate data integration, there is no replacement for human integration practitioners. So data integration projects are until today complex, work-intensive, and expensive. Consequently, they must be subjected to project management in practice, which includes the assessment of the economical feasibility of projects, budgeting, and scheduling. But no good project management is possible without a proper upfront estimate of the effort a project requires and this is indeed a problem often faced in practice.
Therefore, this thesis presents Efes, an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources and possibly non-empty target databases. Efes searches at first for objective complexity factors of a given data integration scenario by inspecting the involved schemas and data, and then converts these complexity factors into a effort estimate, for which it can incorporate external factors like available tools for data integration. Additionally, Efes decomposes the effort estimation problem by the different types of problems that can be faced in data integration and provides models for three such problem types, namely structural conflicts, value heterogeneities, and the baseline effort for creating an executable mapping between the sources and the target. Together, these components form a versatile effort estimation tool. All the design decisions of Efes have been furthermore motivated by a case study, that is also presented in this thesis. Moreover, Efes has been tested with a set of different scenarios, that show that it follows indeed a viable
- Claudia Lehmann : Discovery of Strong- and Weak-Unique Column Combinations in Datasets
We are in a digital area where data is generated constantly around us in applications such as social networks, government, and transaction systems. We need methods to extract the meaningful information contained in this data. Here, the field of data profiling is an essential starting for all processing tasks. Data profiling helps to understand the structure and characteristics of unknown datasets. The discovery of all unique and non-unique column combinations in datasets is a key task of data profiling. Approaches that focus on this discovery assume clean datasets, yet real-world data is dirty.
In this thesis, we introduce the new concept of strong- and weak-unique column combinations in datasets, a refinement of the traditional uniqueness concept. Strong-unique column combinations do neither contain identical nor similar records whereas weak-unique column combinations contain no identical but similar records. Furthermore, we propose the lineage-unaware and the lineage-aware approach to discover all strong- and weak-unique column combinations in datasets. For both approaches we add a record cluster list and a similarity index to optimize the performance.
Finally, we evaluate all thresholds involved in the discovery process and the behavior of the presented approaches with a scaling number of rows, columns and levels in the lattice of column combinations. For this purpose, we use the synthetic TPC-H lineitem table and the real-world NCVoter dataset. We show that both approaches outperform the naive approach that does not use any of our proposed optimization strategies. Additionally, we show that the lineage-aware approach surpasses the lineage-unaware approach with an increasing number of levels in the lattice.
- Andrina Mascher : Discovering Matching Dependencies
Data quality constraints are widely represented as data dependencies. Matching Dependencies have recently shown a great potential to support data management tasks such as data cleansing, inconsistency checking, master data alignment, and fraud detection. Matching Dependencies are an extension of Functional Dependencies as they expand to a second relation and they include fuzzy semantics by matching not only equal but also similar elements. Approximate dependencies can detect data inconsistencies, which need to be checked and corrected. Data experts define Matching Dependencies for given data instances manually which is a time-consuming task and creates results that are biased by samples.
This thesis analyzes Matching Dependencies and proposes an algorithm that automatically discovers Matching Dependencies between two relations using a bottom-up lattice traversal. We introduce the notion of high-quality Matching Dependencies to filter only relevant dependencies from the exponential solution space and further provide a ranking of the results. By employing pruning rules and dependency groups, we reduce the set of possible dependencies and the computational effort to an extend that makes the discovery process feasible for modern data analytics and data cleansing tasks. Because of their capabilities in the data cleansing area, we evaluate the discovered dependencies in a duplicate detection scenario. Our experiments show the behavior and complexity of the discovery process. We observe a possible reduction ratio of 90% to 99% on the considered data.
- Thorben Lindhauer : A Content-Based Serendipity Model for News Recommendation
Online news portals use recommender systems to help readers find interesting articles for further reading out of their vast archive. One desired property of these systems is to induce serendipity, which occurs when a reader is not expecting a recommended article but is interested in it. In consequence, these articles are likely to be read and systems that regularly make serendipitous recommendations are useful for readers. In order to estimate the serendipity potential of an article, a recommender has to be aware of the reader's expectations. However, news recommenders possess little knowledge about their users apart from the currently read article.
Therefore, this thesis proposes a model that estimates the unexpectedness of news articles based solely on their content. Backed by an exploratory user study on how surprising different content features are perceived, the model treats rarely occurring topic combinations as unexpected. This reflects the assumption that the amount of media coverage on a subject determines whether readers expect articles on it. The model is integrated with an existing content-based recommendation algorithm to make recommendations that are unexpected, but also related to the article currently being read.
In a concluding user study, the recommender integrating the serendipity model is evaluated against baseline approaches. It is found that the proposed model does not make more surprising recommendations than an algorithm that recommends articles by their similarity. However, it finds surprising articles of a different kind. While the similarity-based recommender provides serendipitous articles on the same subject, the serendipity model provides serendipitous articles on less related subjects. A strategy to combine the strengths of both algorithms with improved performance is outlined. This demonstrates the possibility to improve existing similarity-based recommenders in order to recommend surprising articles on less related subjects.
- Gary Yao : Large-Scale Twitter Hashtag Recommendation for Documents
Over the last few years, collaborative tagging has emerged as a new approach to organize documents. The principle of this approach is that documents are classified collectively into freely chosen categories (tags) by users. Tags found in collaborative tagging systems are inherently diverse, and often guided by current events. Therefore, building recommendation models that, given a previously unseen document, can predict appropriate and up-to-date tags, is a challenging task. We propose the two recommendation models ALM and HSD, which imitate user tagging behavior and can be incrementally learned on a stream of observed associations between tags and documents. We evaluate our models using a dataset compiled from Twitter, one of the most popular microblogging services, and empirically show that both ALM and HSD deliver tags of high quality.
Finally, we demonstrate that ALM scales horizontally and is able to be incrementally learned on amounts of data that are equivalent to typical Twitter workloads.
- Zuo Zhe : BEL: Bagging for Entity Linking
With recent advances in the areas of knowledge engineering and information extraction, the task of linking textual mentions of named entities to corresponding ones in a knowledge base has received much attention. The rich, structured information in state-of-the-art knowledge bases can be leveraged to facilitate this task. Although recent approaches achieve satisfactory accuracy results, they typically suffer from at least one of the following issues: (1) the linking quality is highly sensitive to the amount of textual information; typically, long textual fragments are needed to capture the context of a mention, (2) the disambiguation uncertainty is not explicitly addressed and often only implicitly represented by the ranking of entities to which a mention could be linked, (3) complex, joint reasoning negatively affects the efficiency.
We propose an entity linking technique that addresses the above issues by (1) operating on a textual range of relevant terms, (2) aggregating decisions from an ensemble of simple classifiers, each of which operates on a randomly sampled subset from the above range, (3) following local reasoning by exploiting previous decisions whenever possible. In extensive experiments on hand-labeled and benchmark datasets, our approach outperformed state-of-the-art entity linking techniques, both in terms of quality and efficiency.
- Peter Büchler : Indexing Genomic Data on Hadoop
In the last years Hadoop has been used as a standard backend for big data applications. Its most known application MapReduce provides a powerful parallel programming paradigm. Big companies, storing petabytes of data, like Facebook and Yahoo deployed their own Hadoop distribution for data analytics, interactive services etc. Nevertheless MapReduce's simplicity in its map stage always leads to a full data scan of the input data and thus potentially wastes resources.
Recently new sources of big data, e.g. the 4k video format or genomic data, have appeared. Genomic data in its raw file format (FastQ) can take up to hundreds of gigabytes per file. Simply using MapReduce for a population analysis would easily end up in a full data scan on terabytes of data. Obviously there is a need for more efficient ways of accessing the data by reducing the amount of data, considered for the computation. Already existing approaches introduce indexing structures into their respective Hadoop distribution. While some of them are specifically made for certain data structures, e.g. key-value pairs, others strongly depend on the existence of a MapReduce framework.
Independently of MapReduce or YARN we integrated an indexing structure into HDFS. This structure supports the definition of own input formats and individual indexing strategies. The building process of an index is integrated into the file writing processes and is independent of software, working in higher layers of Hadoop. As a proof-of-concept though MapReduce and YARN have been given the possibility to make use of these indexing structures by simply adding a new parameter, the index request/query, to the job definition. A prototype and its evaluation will show the advantages of using those structures with genomic data (FastQ and SAM files) as a use case.
- Ahmad Samiei : A semi-supervised method for topic extraction from Twitter data in space and time through a visual analytics approach
Due to advances in Information Technology, in particular, the proliferation of Internet- connected personal devices such as Smart Phones, social networking sites such as Facebook and Twitter have become the most preferred means of communication among private individuals and even in business communication. Social networking accounts such as Twitter provide vast amounts of real-time feedback on various services and enable us to learn and improve services more quickly than ever before. However, Tweets are very short and noisy text fragments with an extremely diverse set of different writing styles and slang, as well as a significant ration of (unintentional and intentional) misspellings and grammatical errors. Due to this very nature of tweets, extracting meaningful information is a very challenging task. Even best-of-breed contemporary text analysis and topic modeling approaches struggle with the analysis of tweet-based corpora. Therefore, there is the clear need for novel and improved analysis methods for text mining on Twitter data sources.
This thesis’ contributions are two-fold. First, an extensive literature survey has been conducted to provide a comprehensive overview of the current state of the art in text/topic modeling and the various challenges faced in different related studies in deriving a desired solution. Moreover, the LDA topic model, widely considered to be the leading approach currently, has been evaluated in detail to find the challenges associated in reaching a more meaningful solution. For this purpose, LDA has been applied on a Twitter corpus generated using a Twitter crawler also developed in the course of this thesis.
Second, this thesis proposes a novel semi-supervised method for topic modeling based on a visual analytics approach to address the identified shortcomings of contemporary approaches including LDA and its variants. In the proposed approach, results from a runtime-efficient automatic data analysis component are visually presented to provide a first intuition regarding relevant topics appearing in the corpus. After carrying out the automatic data analysis, the human analyst uses visual data analysis component to explore and analyze and consequently elaborates based on his supervision. The proposed approach has furthermore been applied in a real data analysis setting regarding the extraction of personal places from geo-located tweets. Results show that the proposed tool facilitated the extraction of more meaningful topics for the tweet corpus related to “work” and “home” places, and moreover revealed the need to improve tweet filtering in the preprocessing stage of the presented case study.
- Markus Freitag : Optimization and Parallelization of Foodborne Disease Outbreak Analyses
Foodborne diseases have a history as long as mankind itself. A disease outbreak has enormous consequences for both society and economy, such as high rates of death and increased costs to the healthcare system. Efforts on national and international level counteract disease outbreaks with varying success. However, the number of pathogens, their resistances, and caused diseases has risen consequently within the last decades resulting in new challenges for risk assessment of foodborne diseases. Current assessment methods rely mostly on manual surveys and do not make use of all the benefits of modern information technology to support the investigations. Hence, new approaches need to be developed that make use of computer-based solutions to deliver insights rapidly while also considering current challenges.
In this thesis, we introduce the Minimum Confidence Level, a new method for outbreak analyses of foodborne diseases. This method allows risk assessors to determine a set of possibly causative food products with a certain confidence. For this purpose, virtual outbreak scenarios are generated and evaluated based on available sales data. The method uses a likelihood-based approach to identify outbreak sources and evaluates the resulting probability for each food product in relation to the information provided by the given dataset. The computed confidence allows considering real-world challenges during outbreak analyses, such as correlated product distributions in the available data and data incompleteness. To address the arising computational costs and to enable a rapid execution of the method, we also introduce a parallelized implementation developed with the Stratosphere framework. Furthermore, we present an integrated workflow for the application and the implementation of the method using the KNIME platform.
We evaluate the Minimum Confidence Level with respect to various influence factors, such as data incompleteness. The evaluation shows that the quality of the results mostly depends on the number of products and the spatial granularity of the data. Additionally, we determine essential parameters for our method such as the optimal number of generated outbreak scenarios. Moreover, we evaluate the performance of our implementation in a scalable environment as well as the influence of certain characteristics of the data on the execution time. The implementation scales almost linearly in relation to the number of nodes utilized in the evaluation environment.
- Lars Kolb : Effiziente MapReduce-Parallelisierung von Entity Resolution-Workflows
In den vergangenen Jahren hat das neu entstandene Paradigma Infrastructure as a Service die IT-Welt massiv verändert. Die Bereitstellung von Recheninfrastruktur durch externe Dienstleister bietet die Möglichkeit, bei Bedarf in kurzer Zeit eine große Menge von Rechenleistung, Speicherplatz und Bandbreite ohne Vorabinvestitionen zu akquirieren. Gleichzeitig steigt sowohl die Menge der frei verfügbaren als auch der in Unternehmen zu verwaltenden Daten dramatisch an. Die Notwendigkeit zur effizienten Verwaltung und Auswertung dieser Datenmengen erforderte eine Weiterentwicklung bestehender IT-Technologien und führte zur Entstehung neuer Forschungsgebiete und einer Vielzahl innovativer Systeme. Ein typisches Merkmal dieser Systeme ist die verteilte Speicherung und Datenverarbeitung in großen Rechnerclustern bestehend aus Standard-Hardware. Besonders das MapReduce-Programmiermodell hat in den vergangenen zehn Jahren zunehmend an Bedeutung gewonnen. Es ermöglicht eine verteilte Verarbeitung großer Datenmengen und abstrahiert von den Details des verteilten Rechnens sowie der Behandlung von Hardwarefehlern. Innerhalb dieser Dissertation steht die Nutzung des MapReduce-Konzeptes zur automatischen Parallelisierung rechenintensiver Entity Resolution-Aufgaben im Mittelpunkt. Entity Resolution ist ein wichtiger Teilbereich der Informationsintegration, dessen Ziel die Entdeckung von Datensätzen einer oder mehrerer Datenquellen ist, die dasselbe Realweltobjekt beschreiben. Im Rahmen der Dissertation werden schrittweise Verfahren präsentiert, welche verschiedene Teilprobleme der MapReduce-basierten Ausführung von Entity Resolution-Workflows lösen.
Zur Erkennung von Duplikaten vergleichen Entity Resolution-Verfahren üblicherweise Paare von Datensätzen mithilfe mehrerer Ähnlichkeitsmaße. Die Auswertung des Kartesischen Produktes von n Datensätzen führt dabei zu einer quadratischen Komplexität von O(n2) und ist deswegen nur für kleine bis mittelgroße Datenquellen praktikabel. Für Datenquellen mit mehr als 100.000 Datensätzen entstehen selbst bei verteilter Ausführung Laufzeiten von mehreren Stunden. Deswegen kommen sogenannte Blocking-Techniken zum Einsatz, die zur Reduzierung des Suchraums dienen. Die zugrundeliegende Annahme ist, dass Datensätze, die eine gewisse Mindestähnlichkeit unterschreiten, nicht miteinander verglichen werden müssen. Die Arbeit stellt eine MapReduce-basierte Umsetzung der Auswertung des Kartesischen Produktes sowie einiger bekannter Blocking-Verfahren vor. Nach dem Vergleich der Datensätze erfolgt abschließend eine Klassifikation der verglichenen Kandidaten-Paare in Match bzw. Non-Match. Mit einer steigenden Anzahl verwendeter Attributwerte und Ähnlichkeitsmaße ist eine manuelle Festlegung einer qualitativ hochwertigen Strategie zur Kombination der resultierenden Ähnlichkeitswerte kaum mehr handhabbar. Aus diesem Grund untersucht die Arbeit die Integration maschineller Lernverfahren in MapReduce-basierte Entity Resolution-Workflows.
Eine Umsetzung von Blocking-Verfahren mit MapReduce bedingt eine Partitionierung der Menge der zu vergleichenden Paare sowie eine Zuweisung der Partitionen zu verfügbaren Prozessen. Die Zuweisung erfolgt auf Basis eines semantischen Schlüssels, der entsprechend der konkreten Blocking-Strategie aus den Attributwerten der Datensätze abgeleitet ist. Beispielsweise wäre es bei der Deduplizierung von Produktdatensätzen denkbar, lediglich Produkte des gleichen Herstellers miteinander zu vergleichen. Die Bearbeitung aller Datensätze desselben Schlüssels durch einen Prozess führt bei Datenungleichverteilung zu erheblichen Lastbalancierungsproblemen, die durch die inhärente quadratische Komplexität verschärft werden. Dies reduziert in drastischem Maße die Laufzeiteffizienz
und Skalierbarkeit der entsprechenden MapReduce-Programme, da ein Großteil der Ressourcen eines Clusters nicht ausgelastet ist, wohingegen wenige Prozesse den Großteil der Arbeit verrichten müssen. Die Bereitstellung verschiedener Verfahren zur gleichmäßigen Ausnutzung der zur Verfügung stehenden Ressourcen stellt einen weiteren Schwerpunkt der Arbeit dar.
Blocking-Strategien müssen stets zwischen Effizienz und Datenqualität abwägen. Eine große Reduktion des Suchraums verspricht zwar eine signifikante Beschleunigung, führt jedoch dazu, dass ähnliche Datensätze, z. B. aufgrund fehlerhafter Attributwerte, nicht miteinander verglichen werden. Aus diesem Grunde ist es hilfreich, für jeden Datensatz mehrere von verschiedenen Attributen abgeleitete semantische Schlüssel zu generieren. Dies führt jedoch dazu, dass ähnliche Datensätze unnötigerweise mehrfach bezüglich verschiedener Schlüssel miteinander verglichen werden. Innerhalb der Arbeit werden deswegen Algorithmen zur Vermeidung solch redundanter Ähnlichkeitsberechnungen präsentiert.
Als Ergebnis dieser Arbeit wird das Entity Resolution-Framework Dedoop präsentiert, welches von den entwickelten MapReduce-Algorithmen abstrahiert und eine High-Level-Spezifikation komplexer Entity Resolution-Workflows ermöglicht. Dedoop fasst alle in dieser Arbeit vorgestellten Techniken und Optimierungen in einem nutzerfreundlichen System zusammen. Der Prototyp überführt nutzerdefinierte Workflows automatisch in eine Menge von MapReduce-Jobs und verwaltet deren parallele Ausführung in MapReduce-Clustern. Durch die vollständige Integration der Cloud-Dienste Amazon EC2 und Amazon S3 in Dedoop sowie dessen Verfügbarmachung ist es für Endnutzer ohne MapReduce-Kenntnisse möglich, komplexe Entity Resolution-Workflows in privaten oder dynamisch erstellten externen MapReduce-Clustern zu berechnen.