Data profiling is the process of extracting metadata from datasets. Researchers have proposed a plethora of profiling algorithms for all different kinds of data dependencies, such as Unique Column Combinations (UCCs), Functional Dependencies (FDs), Inclusion Dependencies (INDs), or Order Dependencies (ODs).
While these algorithms are powerful in uncovering potential relationships, they often produce an overwhelmingly large result set. The discovered dependencies can include many spurious or irrelevant dependencies—they are valid but meaningless. Such noise can complicate data analysis, making it difficult for analysts and data scientists to focus on the most meaningful insights.
In this seminar, we aim to address this challenge by developing modern methods to classify the genuineness of data dependencies. Our goal is to automatically distinguish between dependencies that are truly meaningful and those that are artifacts of randomness or data quirks. By doing so, we hope to enhance the accuracy and effectiveness of data profiling for downstream tasks, such as data cleaning.
If you are interested in participating, please reach out to lukas.laskowski(at)hpi.de and youri.kaminsky(at)hpi.de until October 19 (EOD). If you already know, include your preferred team partner and topic.
Initial Meeting Slides