For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.
AbstractUnique column combinations (UCCs) are groups of attributes in relational datasets that contain no value-entry more than once. Hence, they indicate keys and serve data management tasks, such as schema normalization, data integration, and data cleansing. Because the unique column combinations of a particular dataset are usually unknown, UCC discovery algorithms have been proposed to find them. All previous such discovery algorithms are, however, inapplicable to datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present the hybrid discovery algorithm HyUCC, which uses the same discovery techniques as the recently proposed functional dependency discovery algorithm HyFD: A hybrid combination of fast approximation techniques and efficient validation techniques. With it, the algorithm discovers all minimal unique column combinations in a given dataset. HyUCC does not only outperform all existing approaches, it also scales to much larger datasets.
Data Profiling (tutorial). Abedjan, Ziawasch; Golab, Lukasz; Naumann, Felix (2017).
AbstractProfiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.
Profiling and Mining RDF Data with ProLOD++. Abedjan, Ziawasch; Gruetze, Toni; Jentzsch, Anja; Naumann, Felix (2014).
AbstractThe discovery of unknown functional dependencies in a dataset is of great importance for database redesign, anomaly detection and data cleansing applications. However, as the nature of the problem is exponential in the number of attributes none of the existing approaches can be applied on large datasets. We present a new algorithm DFD for discovering all functional dependencies in a dataset following a depth-first traversal strategy of the attribute lattice that combines aggressive pruning and efficient result verification. Our approach is able to scale far beyond existing algorithms for up to 7.5 million tuples, and is up to three orders of magnitude faster than existing approaches on smaller datasets. Winner of the CIKM 2014 Best Student Paper Award
AbstractThe discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself. In this paper, we devise DUCC, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows DUCC to typically depend on the solution set size and hence to prune large swaths of the lattice. DUCC also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, DUCC runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate DUCC using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare DUCC with related work: Gordian and HCA. The results show that DUCC is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of DUCC to scale up and out.