Prof. Dr. Felix Naumann

Information Systems Group

The research goal of the Information Systems Group is the efficient and effective management of heterogeneous information in large, autonomous systems. This includes methods for data profiling, data preparation and cleansing, information integration, and metadata management. In addition, we perform research in the area of web science and text mining and distributed computing. In 2016 we published overview articles of our research in English (SIGMOD Record) and German (Datenbankspektrum).

Our Team

Research topics

The list below gives an overview of our research topics. Further details can be found on our project and our publications pages. In addition, we maintain a repeatability site to publish code and data.

Data Management

  • Data Profiling: When integrating heterogeneous sources, details of the schema, such as keys, functional dependencies, and foreign key dependencies, are often unknown. We are developing efficient and scalable data profiling methods to automatically detect these and other dependencies in very large databases. Our Metanome project collects various high-efficiency methods into a common framework.
    Links: Metanome project, Vision paper, tutorial slidessurvey, ProLOD++ tool
    Completed projects:  Spider algorithm, Aladin project
    GermanProLOD seminar, Advanced data profiling seminar, lecture
  • Change Exploration: Data and metadata suffer many different kinds of change: values are inserted, deleted or updated; entities appear and disappear; properties are added or re-purposed, etc. Explicitly recognizing, exploring, and evaluating such change can alert to changes in data ingestion procedures, can help assess data quality, and can improve the general understanding of the dataset and its behavior over time.
    Links: DBChEx, Project Janus, VLDB vision paper
  • Data Preparation: Data preparation is a tedious task and accounts for about 80% of the work of data scientists. We are developing an easy-to-use data preparation system including many different data preparation steps.
    Links: Project page, bibliography
  • Data quality / information quality: The quality of data is measured in many different dimensions. Quality values can be aggregated along data operations, for instance to calculate the quality of query results.
  • Duplicate detection: Duplicates are multiple, different representations of the same real-world object, for instance, multiple records of a customer in a CRM database. Duplicate detection try to build systems that efficiently and effectively find such duplicates in large data sets.
    LinksSynthesis lecturerepeatabilityDuDe
    German: Duplikaterkennung allgemeinverständlich

Web Science

  • Text Mining: The analysis of text data, through which high-quality information can be extracted, is know as text mining. It helps understand, compare, and categorize vast quantities of textual data. Links:Comment Analysis, Patent Classification
  • Information Retrieval: Providing access to information was for a long time the task of libraries. With the rise of the Web search engines became a tool for everyone to use everyday. Information retrieval deals with searching and finding information not only in the Web, but also in digital libraries and other information systems. Links:Knowledge Graphs
  • Recommender Systems: With the huge amount of information available today, recommender systems play an increasing role in everyday life. They enable personalized filtering of, e.g. news, products, or Web content. Links: Book Recommendation
  • Social Network Analysis: Social networks, such as Facebook or Twitter, but also email, connect people and content with each other. Understanding these connections and the flow of information in a network is relevant for many application areas, e.g. advertisement, emergency response, or community detection. Links: Corpus Exploration
  • Deep Learning and Topic Models: New ways of representing textual data beyond simple bag-of-words have led to significant performance increase in text mining tasks. Embeddings and deep neural networks are used more and more for natural language processing and text analytics. Topic models are used to summarize huge document collections. Links: Topic Models


  • Bachelor: We offer regular german lectures in database systems, namely Datenbanksysteme I (DBS I) und Datenbanksysteme II (DBS II). In addition we offer a regular introductory seminar on selected database topics, and other occasional project-oriented seminars.
    One-year bachelor projects with 6-8 students finalize bachelor studies at HPI. Our group offers one or two such projects per year in cooperation with external partners.
  • Master: We frequently offer courses in "Information Integration", "Data Profiling", "Distributed Data Management", "Search Engines", and "Information Retrieval". In addition we offer diverse specialized seminars, some theoretical, some project-oriented.
    Half-year master projects with 3-6 students examine a specific research question, usually resulting in a submission to an international conference. Half-year master's theses are the final step before graduation.