Advanced data profiling: Classifying genuineness

Youri Kaminsky, Lukas Laskowski, and Prof. Dr. Felix Naumann

Description

Data profiling is the process of extracting metadata from datasets. Researchers have proposed a plethora of profiling algorithms for all different kinds of data dependencies, such as Unique Column Combinations (UCCs), Functional Dependencies (FDs), Inclusion Dependencies (INDs), or Order Dependencies (ODs).

While these algorithms are powerful in uncovering potential relationships, they often produce an overwhelmingly large result set. The discovered dependencies can include many spurious or irrelevant dependencies—they are valid but meaningless. Such noise can complicate data analysis, making it difficult for analysts and data scientists to focus on the most meaningful insights.

In this seminar, we aim to address this challenge by developing modern methods to classify the genuineness of data dependencies. Our goal is to automatically distinguish between dependencies that are truly meaningful and those that are artifacts of randomness or data quirks. By doing so, we hope to enhance the accuracy and effectiveness of data profiling for downstream tasks, such as data cleaning.

If you are interested in participating, please reach out to lukas.laskowski(at)hpi.de and youri.kaminsky(at)hpi.de until October 19 (EOD). If you already know, include your preferred team partner and topic.

Initial Meeting Slides

Goals

Learn about the research area of data profiling and machine learning
Read and understand scientific papers
Craft a novel solution to the problem of classifying the genuineness of data dependencies
Run experiments and evaluate results
Present results in written and oral form

Organization

General

Project seminar for master students
Language: English
Maximum number of participants: 6 (ideally, 3 teams of 2 students each)

Requirements

Prior knowledge in data profiling (preferably having completed the Data Profiling or Data Integration lecture)
Prior experience with machine learning or deep learning (preferably completed some related course at HPI)
Good programming skills in a major programming language

Grading

In the seminar, each team will develop an approach and write a short report. The final grade consists of the following:

Quality of approach (35%)
Written report (25%)
Midterm presentation (10%)
Final presentation (30%)

You can withdraw from the seminar without consequences until 27th of October.

Modules

TBD

Schedule

Our regular sessions take place Tuesdays from 15:15 to 16:45 in F-2.11.

Date	Topic
October 14	Seminar introduction
October 21	How to read a research paper + paper assignment
October 28	Paper presentation and research idea discussion
November 4	HPI teaching day—no session
November 11	Brief paper presentation
November 18–February 3	Weekly meetings to discuss ongoing progress
February 10, 13:30	Final Presentation
March 17 EOD	Report & artifact submission due date