Prof. Dr. Felix Naumann

Thorsten Papenbrock

Research Assistant, PhD Candidate

für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam
Room: G-3.1.09


Phone: +49 331 5509 294
Email:  thorsten.papenbrock(a)hpi.de
Profiles: Xing
Research: GoogleScholar, DBLP, ResearchGate



Research Interests

Data Profiling:

Solving computationally complex tasks is a challenge and a central activity in data profiling. This involves primarily the discovery of metadata in many gigabyte-sized datasets, which is why algorithms developed for this purpose need to be efficient and robust. Because data profiling offers such a plethora of challenging, yet unsolved tasks, I have chosen it as my primary research area. I am in particular interested in the discovery of data dependencies, such as inclusion dependencies, unique column combinations, functional dependencies, order dependencies, matching dependencies, and many more.

Data Cleansing:

Data is one of the most important assets in any company. Therefore, it is crucial to ensure its quality and reliability. Data cleansing and data profiling are two essential tasks that - if performed correctly and frequently - help to guarantee data fitness. In this area, I am particularly interested in (semi-)automatic duplicate detection methods and normalization techniques as well as their efficient implementation.

Parallel and Distributed Systems:

Due to the complexity of many tasks in IT, a clever algorithm alone is often not able to deliver a solution in time. In these cases, parallel and distributed systems are needed. Especially when facing ever larger datasets, i.e., big data, we need to consider technologies such as map-reduce (e.g. Spark and Flink), actors (e.g. Akka), and GPUs (e.g. CUDA and OpenCL) to implement scalability into our solutions.



  • Database Systems I (2013, 2014, 2015, 2016, 2017)
  • Database Systems II (2013)
  • Data Profiling and Data Cleansing (2014)
  • Information Integration (2015)
  • Data Profiling (2017)


  • Advanced Data Profiling (2013)
  • Proseminar Information Systems (2014)

Bachelor Projects:

  • Data Refinery - Scalable Offer Processing with Apache Spark (2015/2016)

Master Projects:

  • Joint Data Profiling - Holistic Discovery of INDs, FDs, and UCCs (2013)
  • Metadata Trawling - Interpreting Data Profiling Results (2014)
  • Approximate Data Profiling - Efficient Discovery of approximate INDs and FDs (2015)
  • Profiling Dynamic Data - Maintaining Matadata under Inserts, Updates, and Deletes (2016)

Master Thesis:

    • Discovering Matching Dependencies (Andrina Mascher, 2013)
    • Discovery of Conditional Unique Column Combination (Jens Ehrlich, 2014)
    • Spinning a Web of Tables through Inclusion Dependencies (Fabian Tschirschnitz, 2014)
    • Multivalued Dependency Detection (Tim Draeger, 2016)

    Online Courses:

    • Datenmanagement mit SQL (openHPI, 2013)


    Divide & Conquer-based Inclusion Dependency Discovery

    Papenbrock, Thorsten; Kruse, Sebastian; Quiane-Ruiz, Jorge-Arnulfo; Naumann, Felix in Proceedings of the VLDB Endowment 2015 .

    The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose BINDER, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets – an important property on the face of the ever increasing size of today’s data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders BINDER an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of BINDER over the state-of-the-art in both unary (SPIDER) and n-ary (MIND) IND discovery. BINDER is up to 26x faster than SPIDER and more than 2500x faster than MIND.
    Further Information
    Tags binder hpi inclusion_dependencies isg profiling