Advanced Data Profiling
Prof. Dr. Felix Naumann, Sebastian Schmidl, Youri Kaminsky, and Daniel Lindner
Introduction
If you are interested in participating, please attend the initial session and reach out to sebastian.schmidl@hpi.de until November 8 (EOD). You can withdraw from this seminar until November 9 (when we give out the team topics) without consequences.
Please do not hesitate to contact us if you are interested, but the current time slot does not fit your schedule. In this case, please include a note that the current time does not fit you well. We would try to reschedule our meetings to allow more students to participate.
We have our first meeting on Thursday, 19th of October between 11:00 and 12:30 in F-E.06 (HPI Campus II). You can find the schedule below.
Description
Data profiling is the process of extracting metadata from datasets. One important task is the discovery of order dependencies (ODs), which capture the order relationship among attributes in a relational table. There are two prominent ways to express ODs: The list-based form and the set-based canonical form. Current state-of-the-art algorithms for the automatic discovery of order dependencies use the set-based form to benefit from the increased efficiency of a smaller search space. However, most OD usage scenarios require ODs in their list-based form. One example for the application of ODs is query optimization: If a user requests a relation to be ordered by multiple columns, the optimizer can reduce the number of performed sort operations if an OD holds. Notice that the SQL ORDER BY-statement uses lists of attributes. While the discovery algorithms output a complete set of minimal set-based ODs, we need to know if a certain, potentially non-minimal, list-based OD holds to perform the query rewrite. How do we efficiently check whether a given list-based OD can be derived from the set of minimal set-based ODs?
Finding a solution to the task is non-trivial due to the following three technical challenges:
- the complex transformation between list-based and set-based forms (factorial complexity)
- implementation of the known OD inference axioms for a membership test algorithm
- requirement of an efficient data structure to access potentially large collection of valid ODs (hundreds of thousands)
Goals
- Learn about the research area of data profiling
- Read and understand scientific papers
- Craft a novel solution to the problem of efficient OD membership tests
- Run experiments and evaluate results
- Present results in written and oral form
Organization
General
- Project seminar for master students
- Language: English
- Maximum number of participants: 8
We will form teams of two in the first session and each team develops their own approach.
Requirements
- Prior knowledge in data profiling (preferably completed Data Profiling lecture)
- Good programming skills in a major programming language
Grading
In the seminar, each team will develop an approach and write a short report. The final grade consists of the following three parts:
- Approach (35%)
- Written report (35%)
- Midterm and final presentations (30%)
Modules
IT-Systems Engineering MA
- HPI-OSIS-(K/T/S)
Data Engineering MA
- HPI-DANA-(K/T/S)
- HPI-CODS-(K/T/S)
Software Systems Engineering MA
- HPI-DSYS-(C/T/S)
Time Table
| Date | Topic |
| 2023-10-19 (in F-E.06) | Seminar introduction [Slides] |
| 2023-10-26 (in F-E.06) | Intro to order dependencies [Slides] |
| 2023-11-02 (in F-E.06) | Intro to FD-Membership tests [Slides] |
| 2023-11-10 - 2023-12-08 (in F-2.11) | Weekly meetings and progress reports (Fridays from 1pm) |
| 2023-12-14 (L-1.02) | Presentation 1 (Goal & Ideas) |
| 2023-12-15 (F-2.11) | Weekly meeting and presentation feedback |
| 2023-12-22 | (no meeting) |
| 2023-12-29 | Christmas break |
| 2024-01-05 - 2024-02-09 (F-2.11) | Weekly meetings and progress reports |
| 2024-02-16 (F-E.06) | Presentation 2 (Implementation & Experiments) |
| 2024-02-21 | Presentation feedback |
| 2024-03-01 - 2024-03-22 (F-2.11) | Weekly meetings |
| 2024-03-22, WYA | Submission deadline |
Literature
- Data Profiling - Synthesis Lectures on Data Management Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock, Morgan Claypool, 2019.
- Fundamentals of order dependencies. Jaroslaw Szlichta, Parke Godfrey, Jarek Gryz (PVLDB 2012) https://doi.org/10.14778/2350229.2350241.
- Effective and complete discovery of bidirectional order dependencies via set-based axioms. Jaroslaw Szlichta, Parke Godfrey, Lukasz Golab, Mehdi Karger, Divesh Srivastava (The VLDB Journal 2018) https://doi.org/10.1007/s00778-018-0510-0.
- Efficient distributed discovery of bidirectional order dependencies. Sebastian Schmidl, Thorsten Papenbrock (The VLDB Journal 2022) https://doi.org/10.1007/s00778-021-00683-4.
- An algorithmic approach to normalization of relational database schemas. Philip A. Bernstein, Catriel Beeri (Technical Report CSRG-73, University of Toronto 1976) https://ia800105.us.archive.org/24/items/technicalreportc73univ/technicalreportc73univ.pdf.
- Computational Problems Related to the Design of Normal Form Relational Schemas. Catriel Beeri, Philip A. Bernstein (ACM Transactions on Database Systems 1979) https://doi.org/10.1145/320064.320066.