Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Thorsten Papenbrock

Research Assistant, PhD Candidate

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam
Room: E-2-01.2

 

Phone: +49 331 5509 294
Email:  thorsten.papenbrock(a)hpi.de
Profiles: Xing
Research: GoogleScholar, DBLP, ResearchGate


Projects

Metanome

Research Interests

Data Profiling:

Solving computationally complex tasks is a challenge and a central activity in data profiling. This involves primarily the discovery of metadata in many gigabyte-sized datasets, which is why algorithms developed for this purpose need to be efficient and robust. Because data profiling offers such a plethora of challenging, yet unsolved tasks, I have chosen it as my primary research area. I am in particular interested in the discovery of data dependencies, such as inclusion dependencies, unique column combinations, functional dependencies, order dependencies, matching dependencies, and many more.

Data Cleansing:

Data is one of the most important assets in any company. Therefore, it is crucial to ensure its quality and reliability. Data cleansing and data profiling are two essential tasks that - if performed correctly and frequently - help to guarantee data fitness. In this area, I am particularly interested in (semi-)automatic duplicate detection methods and normalization techniques as well as their efficient implementation.

Parallel and Distributed Systems:

Due to the complexity of many tasks in IT, a clever algorithm alone is often not able to deliver a solution in time. In these cases, parallel and distributed systems are needed. Especially when facing ever larger datasets, i.e., big data, we need to consider technologies such as map-reduce (e.g. Spark and Flink), actors (e.g. Akka), and GPUs (e.g. CUDA and OpenCL) to implement scalability into our solutions.

Teaching

Lectures:

Seminars:

  • Advanced Data Profiling (2013)
  • Proseminar Information Systems (2014)

Bachelor Projects:

  • Data Refinery - Scalable Offer Processing with Apache Spark (2015/2016)

Master Projects:

  • Joint Data Profiling - Holistic Discovery of INDs, FDs, and UCCs (2013)
  • Metadata Trawling - Interpreting Data Profiling Results (2014)
  • Approximate Data Profiling - Efficient Discovery of approximate INDs and FDs (2015)
  • Profiling Dynamic Data - Maintaining Matadata under Inserts, Updates, and Deletes (2016)

Master Thesis:

    • Discovering Matching Dependencies (Andrina Mascher, 2013)
    • Discovery of Conditional Unique Column Combination (Jens Ehrlich, 2014)
    • Spinning a Web of Tables through Inclusion Dependencies (Fabian Tschirschnitz, 2014)
    • Multivalued Dependency Detection (Tim Dräger, 2016)

    Online Courses:

    • Datenmanagement mit SQL (openHPI, 2013)

    Publications

    Data Profiling with Metanome (demo)

    Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, Felix Naumann
    Proceedings of the VLDB Endowment, vol. 8(12):1860-1871 2015

    Abstract:

    Data profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many tools have emerged that support data scientists and IT professionals in this task. These tools provide good support for profiling statistics that are easy to compute, but they are usually lacking automatic and efficient discovery of complex statistics, such as inclusion dependencies, unique column combinations, or functional dependencies. We present Metanome, an extensible profiling platform that incorporates many state-of-the-art profiling algorithms. While Metanome is able to calculate simple profiling statistics in relational data, its focus lies on the automatic discovery of complex metadata. Metanome’s goal is to provide novel profiling algorithms from research, perform comparative evaluations, and to support developers in building and testing new algorithms. In addition, Metanome is able to rank profiling results according to various metrics and to visualize the at times large metadata sets.

    Keywords:

    metanome,profiling,hpi

    BibTeX file

    @article{papenbrock2015metanome,
    author = { Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, Felix Naumann },
    title = { Data Profiling with Metanome (demo) },
    journal = { Proceedings of the VLDB Endowment },
    year = { 2015 },
    volume = { 8 },
    number = { 12 },
    pages = { 1860-1871 },
    month = { 0 },
    abstract = { Data profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many tools have emerged that support data scientists and IT professionals in this task. These tools provide good support for profiling statistics that are easy to compute, but they are usually lacking automatic and efficient discovery of complex statistics, such as inclusion dependencies, unique column combinations, or functional dependencies. We present Metanome, an extensible profiling platform that incorporates many state-of-the-art profiling algorithms. While Metanome is able to calculate simple profiling statistics in relational data, its focus lies on the automatic discovery of complex metadata. Metanome’s goal is to provide novel profiling algorithms from research, perform comparative evaluations, and to support developers in building and testing new algorithms. In addition, Metanome is able to rank profiling results according to various metrics and to visualize the at times large metadata sets. },
    keywords = { metanome,profiling,hpi },
    publisher = { VLDB Endowment },
    booktitle = { Proceedings of the International Conference on Very Large Data Bases (PVLDB) },
    issn = { 2150-8097 },
    priority = { 0 }
    }

    Copyright Notice

    This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

    last change: Tue, 12 Apr 2016 15:38:29 +0200