Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Thorsten Papenbrock

Research Assistant, PhD Candidate

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam
Room: G-3.1.09

 

Phone: +49 331 5509 294
Email:  thorsten.papenbrock(a)hpi.de
Profiles: Xing
Research: GoogleScholar, DBLP, ResearchGate


Projects

Metanome

Research Interests

Data Profiling:

Solving computationally complex tasks is a challenge and a central activity in data profiling. This involves primarily the discovery of metadata in many gigabyte-sized datasets, which is why algorithms developed for this purpose need to be efficient and robust. Because data profiling offers such a plethora of challenging, yet unsolved tasks, I have chosen it as my primary research area. I am in particular interested in the discovery of data dependencies, such as inclusion dependencies, unique column combinations, functional dependencies, order dependencies, matching dependencies, and many more.

Data Cleansing:

Data is one of the most important assets in any company. Therefore, it is crucial to ensure its quality and reliability. Data cleansing and data profiling are two essential tasks that - if performed correctly and frequently - help to guarantee data fitness. In this area, I am particularly interested in (semi-)automatic duplicate detection methods and normalization techniques as well as their efficient implementation.

Parallel and Distributed Systems:

Due to the complexity of many tasks in IT, a clever algorithm alone is often not able to deliver a solution in time. In these cases, parallel and distributed systems are needed. Especially when facing ever larger datasets, i.e., big data, we need to consider technologies such as map-reduce (e.g. Spark and Flink), actors (e.g. Akka), and GPUs (e.g. CUDA and OpenCL) to implement scalability into our solutions.

Teaching

Lectures:

  • Database Systems I (2013, 2014, 2015, 2016, 2017)
  • Database Systems II (2013)
  • Data Profiling and Data Cleansing (2014)
  • Information Integration (2015)
  • Data Profiling (2017)
  • Distributed Data Analytics (2017)

Seminars:

  • Advanced Data Profiling (2013, 2017)
  • Proseminar Information Systems (2014)

Bachelor Projects:

  • Data Refinery - Scalable Offer Processing with Apache Spark (2015/2016)

Master Projects:

  • Joint Data Profiling - Holistic Discovery of INDs, FDs, and UCCs (2013)
  • Metadata Trawling - Interpreting Data Profiling Results (2014)
  • Approximate Data Profiling - Efficient Discovery of approximate INDs and FDs (2015)
  • Profiling Dynamic Data - Maintaining Matadata under Inserts, Updates, and Deletes (2016)

Master Thesis:

    • Discovering Matching Dependencies (Andrina Mascher, 2013)
    • Discovery of Conditional Unique Column Combination (Jens Ehrlich, 2014)
    • Spinning a Web of Tables through Inclusion Dependencies (Fabian Tschirschnitz, 2014)
    • Multivalued Dependency Detection (Tim Draeger, 2016)
    • Discovery Algorithms for Conditional Functional Dependencies (Maximilian Grundke, 2017)
    • Discovering Matching Dependencies (Philipp Schirmer, 2017)

    Online Courses:

    • Datenmanagement mit SQL (openHPI, 2013)

    Publications

    • a18-tschirschnitz.pdf
      Tschirschnitz, F., Papenbrock, T., Naumann, F.: Detecting Inclusion Dependencies on Very Many Tables. ACM Transactions on Database Systems (TODS). 42, 18:1-18:29 (2017).
       
    • Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast Approximate Discovery of Inclusion Dependencies. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 207-226 (2017).
       
    • paper.pdf
      Papenbrock, T., Naumann, F.: A Hybrid Approach for Efficient Unique Column Combination Discovery. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 195-204 (2017).
       
    • paper-89.pdf
      Papenbrock, T., Naumann, F.: Data-driven Schema Normalization. Proceedings of the International Conference on Extending Database Technology (EDBT). pp. 342-353 (2017).
       
    • paper-20.pdf
      Ehrlich, J., Roick, M., Schulze, L., Zwiener, J., Papenbrock, T., Naumann, F.: Holistic Data Profiling: Simultaneous Discovery of Various Metadata. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 305-316. OpenProceedings.org (2016).
       
    • mod922.pdf
      Papenbrock, T., Naumann, F.: A Hybrid Approach to Functional Dependency Discovery. Proceedings of the International Conference on Management of Data (SIGMOD). pp. 821-833. ACM, New York, NY, USA (2016).
       
    • fd_paper.pdf
      Bleifuß, T., Bülow, S., Frohnhofen, J., Risch, J., Wiese, G., Kruse, S., Papenbrock, T., Naumann, F.: Approximate Discovery of Functional Dependencies for Large Datasets. Proceedings of the International Conference on Information and Knowledge Management (CIKM). pp. 1803-1812. ACM, New York, NY, USA (2016).
       
    • Data_Anamnesis-_Admitting_Raw_Data_into_an_Organization.pdf
      Kruse, S., Papenbrock, T., Harmouch, H., Naumann, F.: Data Anamnesis: Admitting Raw Data into an Organization. IEEE Data Engineering Bulletin. 39, 8-20 (2016).
       
    • RDFind-_Scalable_Conditional_Inclusion_Dependency_Discovery_in_RDF_Datasets.pdf
      Kruse, S., Jentzsch, A., Papenbrock, T., Kaoudi, Z., Quiane-Ruiz, J.-A., Naumann, F.: RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets. Proceedings of the International Conference on Management of Data (SIGMOD). pp. 953-967. ACM, New York, NY, USA (2016).
       
    • Scaling_out_the_discovery_of_INDs-CR.pdf
      Kruse, S., Papenbrock, T., Naumann, F.: Scaling Out the Discovery of Inclusion Dependencies. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 445-454 (2015).
       
    • ProgressiveDuplicateDetection.pdf
      Papenbrock, T., Heise, A., Naumann, F.: Progressive Duplicate Detection. IEEE Transactions on Knowledge and Data Engineering (TKDE). 27, 1316-1329 (2015).
       
    • p2092-papenbrock.pdf
      Papenbrock, T., Bergmann, T., Finke, M., Zwiener, J., Naumann, F.: Data Profiling with Metanome. Proceedings of the VLDB Endowment. 8, 1860-1871 (2015).
       
    • p1897-papenbrock.pdf
      Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.-P., Schönberg, M., Zwiener, J., Naumann, F.: Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proceedings of the VLDB Endowment. 8, 1082-1093 (2015).
       
    • p559-papenbrock.pdf
      Papenbrock, T., Kruse, S., Quiane-Ruiz, J.-A., Naumann, F.: Divide & Conquer-based Inclusion Dependency Discovery. Proceedings of the VLDB Endowmen. 8, 774-785 (2015).
       
    • Naumann, F., Jenders, M., Papenbrock, T.: Ein Datenbankkurs mit 6000 Teilnehmern - Erfahrungen auf der openHPI MOOC Plattform. Informatik-Spektrum. 37, 333-340 (2013).
       
    • DuplicateDetectionOnGPUs.pdf
      Forchhammer, B., Papenbrock, T., Stening, T., Viehmeier, S., Draisbach, U., Naumann, F.: Duplicate Detection on GPUs. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 165-184 (2013).
       
    • BlackSwan-CIKM.pdf
      Lorey, J., Naumann, F., Forchhammer, B., Mascher, A., Retzlaff, P., ZamaniFarahani, A., Discher, S., Faehnrich, C., Lemme, S., Papenbrock, T., Peschel, R.C., Richter, S., Stening, T., Viehmeier, S.: Black Swan: Augmenting Statistics with Event Data. Proceedings of the 20th Conference on Information and Knowledge Management (CIKM). pp. 2517-2520. , Glasgow, UK (2011).