Dr. Thorsten Papenbrock

Professor (at the University of Marburg)
Head of the Distributed Computing group

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam

Current affiliation: University of Marburg (Website)

Email: thorsten.papenbrock(a)hpi.de
Profiles: Xing, LinkedIn
Research: ORCID, GoogleScholar, DBLP, ResearchGate

Dissertation: Data Profiling - Efficient Discovery of Dependencies

Projects

Metanome

Research Interests

Complex data engineering problems
- e.g. data profiling, data cleaning, and data integration
Parallel and distributed computing challenges
- e.g. robustness, efficiency, and elasticity

Technology Interests

Data flow engines
- e.g. map-reduce derivatives (Spark and Flink)
Message passing systems
- e.g. actor model toolkits (Akka and Orleans) or message queues (Kafka)
Parallel hardware toolkits
- e.g. GPU programming libraries (CUDA and OpenCL)

Teaching

Lectures:

Distributed Data Management (2018, 2019, 2020, 2021)
Distributed Data Analytics (2017)
Data Profiling (2017)
Information Integration (2015)
Data Profiling and Data Cleansing (2014)
Database Systems I (2013, 2014, 2015, 2016, 2017)
Database Systems II (2013)

Seminars:

Sustainable Machine Learning on Edge Device Clusters (2020)
Machine Learning for Data Streams (2019)
Reliable Distributed Systems Engineering (2019)
Mining Streaming Data (2019)
Actor Database Systems (2018)
Proseminar Information Systems (2014)
Advanced Data Profiling (2013, 2017)

Bachelor Projects:

UltraMine - Scalable Analytics on Time Series Data (2020/2021)
DataRefinery - Scalable Offer Processing with Apache Spark (2015/2016)

Master Projects:

Profiling Dynamic Data - Maintaining Matadata under Inserts, Updates, and Deletes (2016)
Approximate Data Profiling - Efficient Discovery of approximate INDs and FDs (2015)
Metadata Trawling - Interpreting Data Profiling Results (2014)
Joint Data Profiling - Holistic Discovery of INDs, FDs, and UCCs (2013)

Master Thesis:

Distributed Duplicate Detection on Streaming Data (Jakob Köhler, 2021)
Distributed Graph Based Approximate Nearest Neighbor Search (Juliane Waack, 2020)
A2DB: A Reactive Database for Theta-Joins (Julian Weise, 2020)
Distributed Detection of Sequential Anomalies in Time Related Sequences (Johannes Schneider, 2020)
Efficient Distributed Discovery of Bidirectional Order Dependencies (Sebastian Schmidl, 2020)
Distributed Unique Column Combination Discovery (Benjamin Feldmann, 2019)
Reactive Inclusion Dependency Discovery (Frederic Schneider, 2019)
Inclusion Dependency Discovery on Streaming Data (Alexander Preuss, 2019)
Generating Data for Functional Dependency Profiling (Jennifer Stamm, 2018)
Efficient Detection of Genuine Approximate Functional Dependencies (Moritz Finke, 2018)
Efficient Discovery of Matching Dependencies (Philipp Schirmer, 2017)
Discovering Interesting Conditional Functional Dependencies (Maximilian Grundke, 2017)
Multivalued Dependency Detection (Tim Draeger, 2016)
Spinning a Web of Tables through Inclusion Dependencies (Fabian Tschirschnitz, 2014)
Discovery of Conditional Unique Column Combination (Jens Ehrlich, 2014)
Discovering Matching Dependencies (Andrina Mascher, 2013)

Online Courses:

Datenmanagement mit SQL (openHPI, 2013)

Selected Talks

Data Profiling at Scale (HPI 2019)
A Hybrid Approach to Functional Dependency Discovery (SIGMOD 2016)
Holistic Data Profiling: Simultaneous Discovery of Various Metadata (EDBT 2016)
Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms (VLDB 2015)
Divide & Conquer-based Inclusion Dependency Discovery (VLDB 2015)

Publications

Schmidl, S., Papenbrock, T.: Efficient Distributed Discovery of Bidirectional Order Dependencies. The VLDB Journal. (2021).

[ Details ]

Schneider, J., Wenig, P., Papenbrock, T.: Distributed detection of sequential anomalies in univariate time series. The International Journal on Very Large Data Bases. (2021).

[ Details ]

Weise, J., Schmidl, S., Papenbrock, T.: Optimized Theta-Join Processing. In: Sattler, K.-U., Herschel, M., and Lehner, W. (eds.) Proceedings of the Conference on Database Systems for Business, Technology, and Web (BTW). pp. 59–78. Gesellschaft für Informatik, Bonn (2021).

[ Download ] [ Details ]

Harmouch, H., Papenbrock, T., Naumann, F.: Relational Header Discovery using Similarity Search in a Table Corpus. IEEE International Conference on Data Engineering (ICDE). 444–455 (2021).

[ Details ]

Kossmann, J., Papenbrock, T., Naumann, F.: Data dependencies for query optimization: a survey. VLDB Journal. (2021).

[ Download ] [ Details ]

Koumarelas, I., Papenbrock, T., Naumann, F.: MDedup: Duplicate Detection with Matching Dependencies. Proceedings of the VLDB Endowment (PVLDB). 13, 712–725 (2020).

[ Download ] [ Details ]

Birnick, J., Bläsius, T., Friedrich, T., Naumann, F., Papenbrock, T., Schirneck, M.: Hitting Set Enumeration with Partial Information for Unique Column Combination Discovery. Proceedings of the VLDB Endowment. 13, 2270–2283 (2020).

[ Download ] [ Details ]

Schirmer, P., Papenbrock, T., Koumarelas, I., Naumann, F.: Efficient Discovery of Matching Dependencies. ACM Transactions on Database Systems (TODS). 45, 1–33 (2020).

[ Download ] [ Details ]

Schirmer, P., Papenbrock, T., Kruse, S., Naumann, F., Hempfing, D., Mayer, T., Neuschäfer-Rube, D.: DynFD: Functional Dependency Discovery in Dynamic Datasets. Proceedings of the International Conference on Extending Database Technology (EDBT). pp. 253–264 (2019).

[ Download ] [ Details ]

10.

Dürsch, F., Stebner, A., Windheuser, F., Fischer, M., Friedrich, T., Strelow, N., Bleifuß, T., Harmouch, H., Jiang, L., Papenbrock, T., Naumann, F.: Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. Proceedings of the International Conference on Information and Knowledge Management (CIKM). pp. 219–228 (2019).

[ Download ] [ Details ]

11.

Schmidl, S., Schneider, F., Papenbrock, T.: An Actor Database System for Akka. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW) - Workshopband. pp. 225–234 (2019).

[ Download ] [ Details ]

12.

Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Morgan & Claypool Publishers (2018).

[ Details ]

13.

Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast Approximate Discovery of Inclusion Dependencies. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 207–226 (2017).

[ Download ] [ Details ]

14.

Tschirschnitz, F., Papenbrock, T., Naumann, F.: Detecting Inclusion Dependencies on Very Many Tables. ACM Transactions on Database Systems (TODS). 42, 18:1–18:29 (2017).

[ Download ] [ Details ]

15.

Papenbrock, T., Naumann, F.: A Hybrid Approach for Efficient Unique Column Combination Discovery. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 195–204 (2017).

[ Download ] [ Details ]

16.

Papenbrock, T., Naumann, F.: Data-driven Schema Normalization. Proceedings of the International Conference on Extending Database Technology (EDBT). pp. 342–353 (2017).

[ Download ] [ Details ]

17.

Kruse, S., Jentzsch, A., Papenbrock, T., Kaoudi, Z., Quiane-Ruiz, J.-A., Naumann, F.: RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets. Proceedings of the International Conference on Management of Data (SIGMOD). pp. 953–967. ACM, New York, NY, USA (2016).

[ Download ] [ Details ]

18.

Kruse, S., Papenbrock, T., Harmouch, H., Naumann, F.: Data Anamnesis: Admitting Raw Data into an Organization. IEEE Data Engineering Bulletin. 39, 8–20 (2016).

[ Download ] [ Details ]

19.

Papenbrock, T., Naumann, F.: A Hybrid Approach to Functional Dependency Discovery. Proceedings of the International Conference on Management of Data (SIGMOD). pp. 821–833. ACM, New York, NY, USA (2016).

[ Download ] [ Details ]

20.

Bleifuß, T., Bülow, S., Frohnhofen, J., Risch, J., Wiese, G., Kruse, S., Papenbrock, T., Naumann, F.: Approximate Discovery of Functional Dependencies for Large Datasets. Proceedings of the International Conference on Information and Knowledge Management (CIKM). pp. 1803–1812. ACM, New York, NY, USA (2016).

[ Download ] [ Details ]

21.

Ehrlich, J., Roick, M., Schulze, L., Zwiener, J., Papenbrock, T., Naumann, F.: Holistic Data Profiling: Simultaneous Discovery of Various Metadata. Proceedings of the International Conference on Extending Database Technology (EDBT). pp. 305–316. OpenProceedings.org (2016).

[ Download ] [ Details ]

22.

Papenbrock, T., Bergmann, T., Finke, M., Zwiener, J., Naumann, F.: Data Profiling with Metanome. Proceedings of the VLDB Endowment. 8, 1860–1871 (2015).

[ Download ] [ Details ]

23.

Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.-P., Schönberg, M., Zwiener, J., Naumann, F.: Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proceedings of the VLDB Endowment. 8, 1082–1093 (2015).

[ Download ] [ Details ]

24.

Papenbrock, T., Heise, A., Naumann, F.: Progressive Duplicate Detection. IEEE Transactions on Knowledge and Data Engineering (TKDE). 27, 1316–1329 (2015).

[ Download ] [ Details ]

25.

Kruse, S., Papenbrock, T., Naumann, F.: Scaling Out the Discovery of Inclusion Dependencies. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 445–454 (2015).

[ Download ] [ Details ]

26.

Papenbrock, T., Kruse, S., Quiane-Ruiz, J.-A., Naumann, F.: Divide & Conquer-based Inclusion Dependency Discovery. Proceedings of the VLDB Endowmen. 8, 774–785 (2015).

[ Download ] [ Details ]

27.

Naumann, F., Jenders, M., Papenbrock, T.: Ein Datenbankkurs mit 6000 Teilnehmern - Erfahrungen auf der openHPI MOOC Plattform. Informatik-Spektrum. 37, 333–340 (2013).

[ Download ] [ Details ]

28.

Forchhammer, B., Papenbrock, T., Stening, T., Viehmeier, S., Draisbach, U., Naumann, F.: Duplicate Detection on GPUs. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 165–184 (2013).

[ Download ] [ Details ]

29.

Lorey, J., Naumann, F., Forchhammer, B., Mascher, A., Retzlaff, P., ZamaniFarahani, A., Discher, S., Faehnrich, C., Lemme, S., Papenbrock, T., Peschel, R.C., Richter, S., Stening, T., Viehmeier, S.: Black Swan: Augmenting Statistics with Event Data. Proceedings of the 20th Conference on Information and Knowledge Management (CIKM). pp. 2517–2520. , Glasgow, UK (2011).

[ Download ] [ Details ]

Dr. Thorsten Papenbrock

Projects

Research Interests

Technology Interests

Teaching

Selected Talks

Publications

Chair

News

03.04.2024 | Congratulations to the EDBT Best Paper Award!

05.03.2024 | Another Paper marked as reproducible by pVLDB Reproducibility Committee

21.01.2024 | Paper accepted at W-NUT 2024

19.12.2023 | Congratulations Dr. Gerardo Vitagliano!

13.12.2023 | Two papers accepted at EDBT Conference 2024

Project highlights

People and open positions