Advanced Data Profiling

Data Dependency Discovery on Dynamic Data

Sebastian Schmidl, Youri Kaminsky, Daniel Lindner, and Prof. Dr. Felix Naumann

Description

Data Profiling for Dynamic Data

Data profiling is the process of extracting metadata from datasets [1]. Researchers have proposed plenty of profiling algorithms for all different kinds of data dependencies, such as Unique Column Combinations (UCCs), Functional Dependencies (FDs), Inclusion Dependencies (INDs), or Order Dependencies (ODs), on static data in a batch process. However, many real-world datasets are constantly changing. These changes, which are inserts, updates, and deletes, also change the datasets’ metadata, making it necessary to frequently re-profile the data. Unfortunately, executing the static profiling algorithms on every dataset change is excessively expense — even infeasible — because the static approaches do not leverage the knowledge about an earlier state of the dataset and its dependencies. This calls for novel incremental discovery algorithms that re-use existing profiling results to efficiently maintain data dependencies for dynamic datasets.

We will start with existing solutions to this problem for the following dependency types (depending on the number of students) and then improve upon them:

UCCs: SWAN [2]
FDs: DynFD [3], DHSFD [4]
INDs: Shaabani's algorithm [5]
ODs: list-based: IncOD [6], pointwise: IncPOD [7]

Seminar Organization

We will form teams of two students each. Every team works on one kind of data dependency. First, the teams become familiar with related work as an inspiration. Afterward, each student team develops their own ideas to profile their dependency type.

The students turn their ideas into working algorithms. There are two main goals for each algorithm:
1) The complete set of minimal or maximal dependencies must be maintained.
2) The runtime of the algorithm is to be optimized.
Datasets for benchmarking are provided to the students. Finally, the students present their approaches and write a short report.

Goals

Learn about the research area of data profiling
Read and understand scientific papers
Craft a novel solution to the problem of data dependency discovery on dynamic data
Run experiments and evaluate results
Present results in written and oral form

Organization

General

Project seminar for master students
Language: English
Maximum number of participants: 8

We will form teams of two in the first session and each team develops their own approach.

Requirements

Prior knowledge in data profiling (preferably completed Data Profiling or Data Integration lecture)
Good programming skills in a major programming language

Grading

In the seminar, each team will develop an approach and write a short report. The final grade consists of the following three parts:

Paper presentation (passed / not passed)
Approach (35%)
Written report (35%)
Midterm and final presentations (30%)

Modules

Computer Science MA

HPI-CS-CR
HPI-CS-DID, HPI-CS-DIS (DATA & AI Track)
HPI-CS-DAD, HPI-CS-DAS (Systems Track)
HPI-CS-DAD (Security Engnineering Track)

IT-Systems Engineering MA

HPI-OSIS-(K/T/S)

Data Engineering MA

HPI-DANA-(K/T/S)
HPI-DASY-(K/T/S)

Software Systems Engineering MA

HPI-DSYS-(C/T/S)

Literature

[1] Data Profiling - Synthesis Lectures on Data Management Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock, Morgan Claypool, 2019.

[2] Z. Abedjan, J. -A. Quiané-Ruiz and F. Naumann. "Detecting unique column combinations on dynamic data." Proceedings of the International Conference on Data Engineering (ICDE), 2014, pp. 1036-1047, doi: 10.1109/ICDE.2014.6816721.

[3] P. Schirmer, T. Papenbrock, S. Kruse, D. Hempfing, T. Meyer, D. Neuschäfer-Rube and F. Naumann. "DynFD: Functional Dependency Discovery in Dynamic Datasets." Proceedings of the International Conference on Extending Database Technology (EDBT), 2019.

[4] R. Xiao, Y. Yuan, Z. Tan, S. Ma and W. Wang. "Dynamic Functional Dependency Discovery with Dynamic Hitting Set Enumeration." Proceedings of the International Conference on Data Engineering (ICDE), 2022, pp. 286-298, doi: 10.1109/ICDE53745.2022.00026.

[5] Nuhad Shaabani and Christoph Meinel. "Incremental discovery of inclusion dependencies." Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), 2017. doi: 10.1145/3085504.3085506

[6] L. Zhu, X. Sun, Z. Tan, K. Yang, W. Yang, X. Zhou and Y. Tian. "Incremental Discovery of Order Dependencies on Tuple Insertions". Proceedings of the International Conference on Database Systems for Advanced Applications, 2019, pp. 157 - 174. doi: 10.1007/978-3-030-18576-3_10

[7] Z. Tan, A. Ran, S. Ma and S. Qin. "Fast incremental discovery of pointwise order dependencies". Proceedings of the VLDB Endowment (VLDB) 13 (10), 2020, pp. 1669–1681. doi: 10.14778/3401960.3401965