Advanced Data Profiling: Classifying genuineness
Youri Kaminsky, Lukas Laskowski, and Prof. Dr. Felix Naumann
Description
Data profiling is the process of extracting metadata from datasets. Researchers have proposed a plethora of profiling algorithms for all different kinds of data dependencies, such as Unique Column Combinations (UCCs), Functional Dependencies (FDs), Inclusion Dependencies (INDs), or Order Dependencies (ODs).
While these algorithms are powerful in uncovering potential relationships, they often produce an overwhelmingly large result set. The discovered dependencies can include many spurious or irrelevant dependencies—they are valid but meaningless. Such noise can complicate data analysis, making it difficult for analysts and data scientists to focus on the most meaningful insights.
In this seminar, we aim to address this challenge by developing modern methods to classify the genuineness of data dependencies. Our goal is to automatically distinguish between dependencies that are truly meaningful and those that are artifacts of randomness or data quirks. By doing so, we hope to enhance the accuracy and effectiveness of data profiling for downstream tasks, such as data cleaning.
Goals
- Learn about the research area of data profiling and machine learning
- Read and understand scientific papers
- Craft a novel solution to the problem of classifying the genuineness of data dependencies
- Run experiments and evaluate results
- Present results in written and oral form
Organization
General
- Project seminar for master students
- Language: English
- Maximum number of participants: 6 (ideally, 3 teams of 2 students each)
Requirements
- Prior knowledge in data profiling (preferably having completed the Data Profiling or Data Integration lecture)
- Prior experience with machine learning or deep learning (preferably completed some related course at HPI)
- Good programming skills in a major programming language
Grading
In the seminar, each team will develop an approach and write a short report. The final grade consists of the following:
- Quality of approach (35%)
- Written report (25%)
- Midterm presentation (10%)
- Final presentation (30%)
You can withdraw from the seminar without consequences until 27th of October.
Modules
TBD
Schedule
Our regular sessions take place Tuesdays from 15:15 to 16:45 in F-2.11.
| Date | Topic |
| October 14 | Seminar introduction |
| October 21 | How to read a research paper + paper assignment |
| October 28 | Paper presentation and research idea discussion |
| November 4 | HPI teaching day—no session |
| November 11 | Brief paper presentation |
| November 18–February 3 | Weekly meetings to discuss ongoing progress |
| TBD | Final Presentation |
| TBD | Report & artifact submission due date |