Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Information Systems Group

The research goal of the Information Systems Group is the efficient and effective management of heterogeneous information in large, autonomous systems. This includes methods for data profiling, data preparation and cleansing, information integration, and metadata management. In addition, we perform research in the area of text mining and distributed computing. In 2016 we published overview articles of our research in English (SIGMOD Record) and German (Datenbankspektrum).

Our Team

From left to right: Phillip Wenig, Nitisha Jain, Sebastian Schmidl, Mazhar Hameed, Anna Zobel, Felix Naumann, Leon Bornemann, Gerardo Vitagliano, Hazar Harmouch, Alejandro Sierra (not in photo: Tobias Bleifuß, Youri Kaminsky, Sedir Mohammed, Kerstin Neubert, Diana Stephan)

Research

The list below gives an overview of our research topics. Further details can be found on our project and our publications pages. In addition, we maintain a repeatability site to publish code and data.

  • Data Profiling: When integrating heterogeneous sources, details of the schema, such as keys, functional dependencies, and foreign key dependencies, are often unknown. We are developing efficient and scalable data profiling methods to automatically detect these and other dependencies in very large databases. Our Metanome project collects various high-efficiency methods into a common framework.
    Links: Metanome project, Vision paper, tutorial slidessurvey, ProLOD++ tool
    Completed projects:  Spider algorithm, Aladin project
    GermanProLOD seminar, seminars, lecture
  • Change Exploration: Data and metadata suffer many different kinds of change: values are inserted, deleted or updated; entities appear and disappear; properties are added or re-purposed, etc. Explicitly recognizing, exploring, and evaluating such change can alert to changes in data ingestion procedures, can help assess data quality, and can improve the general understanding of the dataset and its behavior over time.
    Links: DBChEx, Project Janus, VLDB vision paper
  • Data Preparation: Data preparation is a tedious task and accounts for about 80% of the work of data scientists. Our research is concerned with developing easy-to-use, automated and user-friendly data preparation systems and algorithms to cover different data preparation steps.
    Links: Projects page, bibliography
  • Data quality / information quality: The quality of data is measured in many different dimensions. Quality values can be aggregated along data operations, for instance to calculate the quality of query results.
  • Duplicate detection: Duplicates are multiple, different representations of the same real-world object, for instance, multiple records of a customer in a CRM database. Duplicate detection try to build systems that efficiently and effectively find such duplicates in large data sets.
    LinksSynthesis lecturerepeatabilityDuDe
    German: Duplikaterkennung allgemeinverständlich 
  • Text Mining: The analysis of text data, through which high-quality information can be extracted, is know as text mining. It helps understand, compare, and categorize vast quantities of textual data. Links: AI4Art
  • Deep Learning for Natural Language Processing: New ways of representing textual data beyond simple bag-of-words have led to significant performance increase in text mining tasks. Embeddings and deep neural networks are used more and more for natural language processing and text analytics. Links:Knowledge Graphs

Teaching

  • Bachelor: We offer regular german lectures in database systems, namely Datenbanksysteme I (DBS I) und Datenbanksysteme II (DBS II). In addition, we offer a regular introductory seminar on selected database topics, and other occasional project-oriented seminars.
    One-year bachelor projects with 6-8 students finalize bachelor studies at HPI. Our group offers one or two such projects per year in cooperation with external partners.
  • Master: We frequently offer courses in "Information Integration", "Data Profiling", "Distributed Data Management", "Search Engines", and "Information Retrieval". In addition, we offer diverse specialized seminars, some theoretical, some project-oriented.
    Half-year master projects with 3-6 students examine a specific research question, usually resulting in a submission to an international conference. Half-year master's theses are the final step before graduation.