For bachelor students we offer German lectures on database systems in addition with paper- or project-oriented seminars. Within a one-year bachelor project students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, search engines and information retrieval enhanced by specialized seminars, master projects and advised master theses.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our data sets and source code.
AbstractUnique column combinations (UCCs) are groups of attributes in relational datasets that contain no value-entry more than once. Hence, they indicate keys and serve data management tasks, such as schema normalization, data integration, and data cleansing. Because the unique column combinations of a particular dataset are usually unknown, UCC discovery algorithms have been proposed to find them. All previous such discovery algorithms are, however, inapplicable to datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present the hybrid discovery algorithm HyUCC, which uses the same discovery techniques as the recently proposed functional dependency discovery algorithm HyFD: A hybrid combination of fast approximation techniques and efficient validation techniques. With it, the algorithm discovers all minimal unique column combinations in a given dataset. HyUCC does not only outperform all existing approaches, it also scales to much larger datasets.
Progressive Duplicate Detection.Papenbrock, Thorsten; Heise, Arvid; Naumann, Felix in IEEE Transactions on Knowledge and Data Engineering (TKDE) (2015). 27(5) 1316-1329.
AbstractDuplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional duplicate detection and significantly improve upon related work.
SOFA: An Extensible Logical Optimizer for UDF-heavy Data Flows.Rheinländer, Astrid; Heise, Arvid; Hueske, Fabian; Leser, Ulf; Naumann, Felix in Information Systems (2015). 52 96-125.
AbstractRecent years have seen an increased interest in large-scale analytical data flows on non-relational data. These data flows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such data flows are user-defined predicates or functions (UDFs). However, the heavy use of UDFs is not well taken into account for data flow optimization in current systems. SOFA is a novel and extensible optimizer for UDF-heavy data flows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style UDFs and a small set of rewrite rules, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of our approach is extensibility: We arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. We evaluate SOFA on a selection of UDF-heavy data flows from different domains and compare its performance to three other algorithms for data flow optimization. Our experiments reveal that SOFA finds efficient plans, outperforming the best plans found by its competitors by a factor of up to six.
Estimating the Number and Sizes of Fuzzy-Duplicate Clusters.Heise, Arvid; Kasneci, Gjergji; Naumann, Felix (2014). 959-968.
AbstractWe present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of “Big Data” use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.
Versatile optimization of UDF-heavy data flows with SOFA (demo).Rheinländer, Astrid; Beckmann, Martin; Kunkel, Anja; Heise, Arvid; Stoltmann, Thomas; Leser, Ulf (2014). 685-688.
AbstractThe discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself. In this paper, we devise DUCC, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows DUCC to typically depend on the solution set size and hence to prune large swaths of the lattice. DUCC also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, DUCC runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate DUCC using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare DUCC with related work: Gordian and HCA. The results show that DUCC is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of DUCC to scale up and out.
Applying Stratosphere for Big Data Analytics.Leich, Marcus; Adamek, Jochen; Schubotz, Moritz; Heise, Arvid; Rheinlander, Astrid; Markl, Volker (2013).
AbstractGovernments are increasingly publishing their data to enable organizations and citizens to browse and analyze thedata. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integrationand thus limits the desired transparency.In this article, we present the newly developed data integration operators of the Stratosphere parallel data analysisframework to overcome the heterogeneity. With declaratively specified queries, we demonstrate the integration ofwell-known government data sources and other large open data sets at technical, structural, and semantic levels.Furthermore, we publish the integrated data on theWeb in a form that enables users to discover relationships betweenpersons, government agencies, funds, and companies. The evaluation shows that linking person entities of dierentdata sets results in a good precision of 98.3% and a recall of 95.2%. Moreover, the integration of large data sets scaleswell on up to eight machines.
GovWILD: Integrating Open Government Data for Transparency (demo).Böhm, Christoph; Freitag, Markus; Heise, Arvid; Lehmann, Claudia; Mascher, Andrina; Naumann, Felix; Hernandez, Mauricio; Ercegovac, Vuk; Haase, Peter (2012).