Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Dr. Thorsten Papenbrock

Senior Researcher
Head of the Distributed Computing group

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam
Office: F-2.04, Campus II

 

Phone: +49 331 5509 294
Email:  thorsten.papenbrock(a)hpi.de
Profiles: Xing, LinkedIn
Research: ORCID, GoogleScholar, DBLP, ResearchGate

Dissertation: Data Profiling - Efficient Discovery of Dependencies


Projects

Metanome

Research Interests

Technology Interests

  • Data flow engines

  • Message passing systems

  • Parallel hardware toolkits

Teaching

Lectures:

  • Distributed Data Management (2018, 2019)
  • Distributed Data Analytics (2017)
  • Data Profiling (2017)
  • Information Integration (2015)
  • Data Profiling and Data Cleansing (2014)
  • Database Systems I (2013, 2014, 2015, 2016, 2017)
  • Database Systems II (2013)

Seminars:

  • Reliable Distributed Systems Engineering (2019)
  • Mining Streaming Data (2019)
  • Actor Database Systems (2018)
  • Proseminar Information Systems (2014)
  • Advanced Data Profiling (2013, 2017)

Bachelor Projects:

  • Data Refinery - Scalable Offer Processing with Apache Spark (2015/2016)

Master Projects:

  • Profiling Dynamic Data - Maintaining Matadata under Inserts, Updates, and Deletes (2016)
  • Approximate Data Profiling - Efficient Discovery of approximate INDs and FDs (2015)
  • Metadata Trawling - Interpreting Data Profiling Results (2014)
  • Joint Data Profiling - Holistic Discovery of INDs, FDs, and UCCs (2013)

Master Thesis:

  • Distributed Unique Column Combination Discovery (Benjamin Feldmann, 2019)
  • Reactive Inclusion Dependency Discovery (Frederic Schneider, 2019)
  • Inclusion Dependency Discovery on Streaming Data (Alexander Preuss, 2019)
  • Generating Data for Functional Dependency Profiling (Jennifer Stamm, 2018)
  • Efficient Detection of Genuine Approximate Functional Dependencies (Moritz Finke, 2018)
  • Efficient Discovery of Matching Dependencies (Philipp Schirmer, 2017)
  • Discovering Interesting Conditional Functional Dependencies (Maximilian Grundke, 2017)
  • Multivalued Dependency Detection (Tim Draeger, 2016)
  • Spinning a Web of Tables through Inclusion Dependencies (Fabian Tschirschnitz, 2014)
  • Discovery of Conditional Unique Column Combination (Jens Ehrlich, 2014)
  • Discovering Matching Dependencies (Andrina Mascher, 2013)

Online Courses:

  • Datenmanagement mit SQL (openHPI, 2013)

Publications

Data Profiling with Metanome

Papenbrock, Thorsten; Bergmann, Tanja; Finke, Moritz; Zwiener, Jakob; Naumann, Felix in Proceedings of the VLDB Endowment 2015 .

Data profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many tools have emerged that support data scientists and IT professionals in this task. These tools provide good support for profiling statistics that are easy to compute, but they are usually lacking automatic and efficient discovery of complex statistics, such as inclusion dependencies, unique column combinations, or functional dependencies. We present Metanome, an extensible profiling platform that incorporates many state-of-the-art profiling algorithms. While Metanome is able to calculate simple profiling statistics in relational data, its focus lies on the automatic discovery of complex metadata. Metanome’s goal is to provide novel profiling algorithms from research, perform comparative evaluations, and to support developers in building and testing new algorithms. In addition, Metanome is able to rank profiling results according to various metrics and to visualize the at times large metadata sets.
[ URL ] [ DOI ]
Data Profiling with Metan... - Download
Further Information
Tags basic_statistics  functional_dependencies  hpi  inclusion_dependencies  isg  metanome  multivalued_dependencies  order_dependencies  profiling  unique_column_combinations