For bachelor students we offer German lectures on database systems in addition with paper- or project-oriented seminars. Within a one-year bachelor project students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, search engines and information retrieval enhanced by specialized seminars, master projects and advised master theses.
The Web Science group focuses on various topics related to the Web, such as Information Retrieval, Natural Language Processing, Data Mining, Knowledge Discovery, Social Network Analysis, Entity Linking, and Recommender Systems. The group is particularly interested in Text Mining to deal with the vast amount of unstructured and semi-structured information available on the Web.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our data sets and source code.
Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.
The Metanome project is a project at HPI in cooperation with the Qatar Computing Reserach Institute (QCRI). Metanome provides a fresh view on data profiling by developing and integrating efficient algorithms into a common tool, expanding on the functionality of data profiling, and addressing performance and scalabilities issues for Big Data. A vision of the Metanome project appeared in SIGMOD Record "Data Profiling Revisited" and demo of the Metanome profiling tool was given at VLDB 2015 "Data Profiling with Metanome".
Tanja Bergmann (Backend, Frontend, and Architecture)
Vincent Schwarzer (Backend and Architecture)
Maxi Fischer (Backend and Frontend)
Moritz Finke (Backend and Architecture)
Carl Ambroselli (Frontend)
Jakob Zwiener (Backend and Architecture)
Claudia Exeler (Frontend)
Projects within Metanome
Unique column combination discovery As prerequisite for unique constraints and keys, UCCs are a basic piece of metadata for any table. The problem is of particular complexity when regarding the exponential number of column combinations. We adress the problem by parallelization and pruning strategies. This work is in collaboration with QCRI.
Inclusion dependency discovery As prerequisite of foreign keys, INDs can tell us how tables within a schema can be connected. When regarding tables of different data sources, conditional IND discovery is of particular relevance. See also the completed Aladin project and publications by Jana Bauckmann et al., in particular our Spider algorithm.
Incremental dependency discovery We are extending our work on UCC and IND discovery to tables that receive incremental updates. The goal is to avoid a complete re-computation and restrict processing to relevant columns, records, and dependencies.
Profiling and Mining RDF data The <subject, predicate, object> data model of RDF necessitates new approaches to basic profiling and data mining methods. See also: ProLOD++ demo
Functional dependency discovery Functional dependencies express relationships between attributes of a database relation and are extensively used in data analysis and database design, especially schema normalization. We contribute to research in this area by evaluating current state-of-the-art algorithms and developing faster and more scalable approaches. See also: FD algorithms
Order dependency discovery Order dependencies (ODs) describe a relationship of order between lists of attributes in a relational table. ODs can help to understand the semantics of datasets and the appli- cations producing them. The existence of an OD in a table can provide hints on which integrity constraints are valid for the domain of the data at hand. Moreover, order dependen- cies have applications in the field of query optimization by suggesting query rewrites. See also: OD algorithms
Papenbrock, Thorsten; Kruse, Sebastian; Quiane-Ruiz, Jorge-Arnulfo; Naumann, Felix
Proceedings of the VLDB Endowment
The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose BINDER, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets – an important property on the face of the ever increasing size of today’s data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders BINDER an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of BINDER over the state-of-the-art in both unary (SPIDER) and n-ary (MIND) IND discovery. BINDER is up to 26x faster than SPIDER and more than 2500x faster than MIND.