Information Systems Group

The research goal of the Information Systems Group is the efficient and effective management of heterogeneous information in large, autonomous systems. This includes methods for data profiling, data preparation and cleansing, information integration, and metadata management. In addition, we perform research in the area of text mining and distributed computing. In 2016 we published overview articles of our research in English (SIGMOD Record) and German (Datenbankspektrum).

Our Team

From left to right: Phillip Wenig, Nitisha Jain, Sebastian Schmidl, Mazhar Hameed, Anna Zobel, Felix Naumann, Leon Bornemann, Gerardo Vitagliano, Hazar Harmouch, Alejandro Sierra (not in photo: Tobias Bleifuß, Youri Kaminsky, Sedir Mohammed, Kerstin Neubert, Diana Stephan)

Research

The list below gives an overview of our research topics. Further details can be found on our project and our publications pages. In addition, we maintain a repeatability site to publish code and data.

Data Profiling: When integrating heterogeneous sources, details of the schema, such as keys, functional dependencies, and foreign key dependencies, are often unknown. We are developing efficient and scalable data profiling methods to automatically detect these and other dependencies in very large databases. Our Metanome project collects various high-efficiency methods into a common framework.
Links: Metanome project, Vision paper, tutorial slides, survey, ProLOD++ tool
Completed projects: Spider algorithm, Aladin project
German: ProLOD seminar, seminars, lecture
Change Exploration: Data and metadata suffer many different kinds of change: values are inserted, deleted or updated; entities appear and disappear; properties are added or re-purposed, etc. Explicitly recognizing, exploring, and evaluating such change can alert to changes in data ingestion procedures, can help assess data quality, and can improve the general understanding of the dataset and its behavior over time.
Links: DBChEx, Project Janus, VLDB vision paper
Data Preparation: Data preparation is a tedious task and accounts for about 80% of the work of data scientists. Our research is concerned with developing easy-to-use, automated and user-friendly data preparation systems and algorithms to cover different data preparation steps.
Links: Projects page, bibliography
Data quality / information quality: The quality of data is measured in many different dimensions. Quality values can be aggregated along data operations, for instance to calculate the quality of query results.
Duplicate detection: Duplicates are multiple, different representations of the same real-world object, for instance, multiple records of a customer in a CRM database. Duplicate detection try to build systems that efficiently and effectively find such duplicates in large data sets.
Links: Synthesis lecture, repeatability, DuDe
German: Duplikaterkennung allgemeinverständlich
Text Mining: The analysis of text data, through which high-quality information can be extracted, is know as text mining. It helps understand, compare, and categorize vast quantities of textual data. Links: AI4Art
Deep Learning for Natural Language Processing: New ways of representing textual data beyond simple bag-of-words have led to significant performance increase in text mining tasks. Embeddings and deep neural networks are used more and more for natural language processing and text analytics. Links:Knowledge Graphs

Teaching

Bachelor: We offer regular german lectures in database systems, namely Datenbanksysteme I (DBS I) und Datenbanksysteme II (DBS II). In addition, we offer a regular introductory seminar on selected database topics, and other occasional project-oriented seminars.
One-year bachelor projects with 6-8 students finalize bachelor studies at HPI. Our group offers one or two such projects per year in cooperation with external partners.
Master: We frequently offer courses in "Information Integration", "Data Profiling", "Distributed Data Management", "Search Engines", and "Information Retrieval". In addition, we offer diverse specialized seminars, some theoretical, some project-oriented.
Half-year master projects with 3-6 students examine a specific research question, usually resulting in a submission to an international conference. Half-year master's theses are the final step before graduation.

Embedded YouTube video

Note:This embedded video is provided by YouTube, LLC, 901 Cherry Ave, San Bruno, CA 94066, USA.
When playing the video, a connection to the Youtube servers is established. Youtube will be informed which pages you visit. If you are logged into your Youtube account, Youtube can assign your surfing behavior to you individually. You can prevent this by logging out of your YouTube account beforehand.

Data privacy Show video

Information Systems Group

Our Team

Research

Teaching

Chair

News

03.04.2024 | Congratulations to the EDBT Best Paper Award!

05.03.2024 | Another Paper marked as reproducible by pVLDB Reproducibility Committee

21.01.2024 | Paper accepted at W-NUT 2024

19.12.2023 | Congratulations Dr. Gerardo Vitagliano!

13.12.2023 | Two papers accepted at EDBT Conference 2024

Project highlights

People and open positions