Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Approximate Data Profiling

Introduction

These are the introductory slides of the seminar.

If you are interested in participating, please reach out to tobias.bleifuss(at)hpi.de until October 25.

Please do not hesitate to contact us if you are interested, but the current time slot does not fit your schedule. In this case, please include a note that the current time does not fit you well. We would try to reschedule our meetings to allow more students to participate.

Description

Data profiling is the process of extracting metadata from datasets. One important aspect is the discovery of data dependencies, such as Functional Dependencies (FDs), Inclusion Dependencies (INDs) and Unique Column Combinations (UCCs). However, the increasing size of datasets presents a challenge to traditional approaches of data profiling. Therefore, this seminar focuses on sampling-based methods for approximate data profiling.

First, the students become familiar with related work as an inspiration. Afterwards, each student team develops own ideas. These can concern both the sampling process itself or the actual discovery in the sample.

The students turn their ideas into working algorithms. There are two main goals for each algorithm:
1) Find a set of dependencies that is close to the actual solution.
2) Minimize the required runtime.
Datasets for benchmarking are provided to the students.
Finally, the students present their approaches and write a short report.

Literature

Time Table

DateTopic
October 20, 1:30pm F-E.06Seminar introduction
October 27, 1:30pm F-2.10Intro data profiling + Metanome
November 03, 1:30pm F-2.10Exact discovery algorithms on a sample
November 10, 1:30pm F-2.10Exact discovery algorithms on a sample (2)
November 17, 1:30pm F-2.10Approximate discovery algorithms and evaluation metrics
November 24, 1:30pm F-2.10Progress reports
December 1, 1:30pm F-2.10Progress reports
December 8, 12:45pm F-2.10Midterm presentations
(overview over our exploration results and decide for one approach)
December 15, 1:30pm F-2.10Weekly meeting
January 5, 1:30pm F-2.10Weekly meeting
January 12, 1:30pm F-2.10Weekly meeting
January 19, 1:30pm F-2.10Weekly meeting
January 26, 1:30pm F-E.0.6Weekly meeting
January 27, 1:30pm F-2.10Optional session: Giving Scientific Presentations
February 2, 1:30pm F-E.06Final presentations
February 9, 1:30pm F-E.06Discuss paper-style submisison
March 17, 2023Submission deadline
  

Goals

  • Learn about the research area data profiling
  • Read papers and understand them
  • Craft a novel solution to the problem of sample-based profiling
  • Run experiments and evaluate results
  • Present results in written and oral form

Organization

General

  • Seminar for master students 
  • Language of instruction: English
  • Maximum number of participants: 6

Topics will be presented in the first session (October 20, 2022 1:30pm F-E.06). For group assignments, participants will have to write us an email individually.

Requirements

We do not require any prior knowledge about data profiling.

However, there are some requirements for participating in the course:

  • Interest in the topic
  • Interest in working with large data sets
  • Java (at least basic skills)

Grading

In the seminar, each participant will develop an approach in the research area of sampling-based data profiling and write a short report. The final grade consists of the following three parts:

  • Approach (35%)
  • Written report (35%)
  • Presentations and discussions in the seminar (30%)