Approximate Data Profiling

Prof. Dr. Felix Naumann, Tobias Bleifuß, Leon Bornemann and Youri Kaminsky

Introduction

These are the introductory slides of the seminar.

If you are interested in participating, please reach out to youri.kaminsky@hpi.de. Please include a note if the current time slot does not fit your schedule. We would try to reschedule our meetings to allow more students to participate.

Description

Data profiling is the process of extracting metadata from datasets. One important aspect is the discovery of data dependencies, such as Functional Dependencies (FDs), Inclusion Dependencies (INDs) and Unique Column Combinations (UCCs). However, the increasing size of datasets presents a challenge to traditional approaches of data profiling. Therefore, this seminar focuses on sampling-based methods for approximate data profiling.

First, the students become familiar with related work as an inspiration. Afterwards, each student team develops own ideas. These can concern both the sampling process itself or the actual discovery in the sample. The students turn their ideas into working algorithms. There are two main goals for each algorithm:
1) Find a set of dependencies that is close to the actual solution.
2) Minimize the required runtime.
Datasets for benchmarking are provided to the students.
Finally, the students present their approaches and write a short report.

Literature

Data Profiling - Synthesis Lectures on Data Management Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock, Morgan Claypool, 2019.
Sampling for Big Data Profiling: A Survey. Zhicheng Liu and Aoqian Zhang, IEEE Access, 2020.

Time Table

Date	Topic
19.04.2022 F-E.06	Seminar introduction
10.05.2022 F-E.06	Present 1 paper of related work
14.06.2022 F-E.06	Midterm presentation
19.07.2022 F-E.06	Final presentation
29.07.2022	Submission deadline

Goals

Learn about the research area data profiling
Read papers and understand them
Craft a novel solution to the problem of sample-based profiling
Run experiments and evaluate results
Present results in written and oral form

Organization

General

Seminar for master students
Language of instruction: English
Maximum number of participants: 12

Topics will be presented in the first session (Tuesday, April 19, 2022 in room F-E0.6 at 13:30). For group assignments, participants will have to write us an email individually.

Requirements

We do not require any prior knowledge about data profiling.

However, there are some requirements for participating in the course:

Interest in the topic
Interest in working with large data sets
Java (at least basic skills)

Grading

In the seminar, each participant will develop an approach in the research area of sampling-based data profiling and write a short report. The final grade consists of the following three parts:

Approach (35%)
Written report (35%)
Presentations and discussions in the seminar (30%)