Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Results

  • Algorithms and Datasets
  • This seminar was the basis for a successful submission to the VLDB 2015 Experiments & Analysis Track.

Description

Data profiling constitutes the process of automatically analyzing a given dataset for metadata. The different techniques used in this process reveal intra-column properties, inter-column dependencies and various table-wide characteristics. Once determined, this metadata enables the data owner to detect errors, integrate other sources, normalize schemata, or define additional attribute properties.

The information systems group is currently developing a profiling platform called Metanome, which incorporates various algorithms for the discovery of Inclusion Dependencies, Functional Dependencies, Unique Column Combinations, and various other metrics. In this seminar, we join the Metanome project and design advanced profiling algorithms to be used in practice. More specifically, we examine different algorithms for the identification of Functional Dependencies, improve their performance, and finally integrate them into Metanome.

Literature

The algorithms that we will look at in this seminar are the following:

The following paper also focusses on FD_Mine, but since FD_Mine is the most recent approach for the identification of FDs, it nicely compares all previous algorithms:

Organization

Students form teams of two members. Each team is assigned one profiling algorithm and the according publication. After studying this (and further) literature, the teams should implement and evaluate their algorithms. All implementations need to integrate the Metanome interface to be compatible with this project. Furthermore, we expect each team to find or produce an own dataset to evaluate their implementation. These datasets should also be passed to other teams so that we can compare the different approaches in the end. To present the baseline algorithms and the results of the first phase to the whole group, all teams will give short mid-term presentations.

In the second half of the seminar, each team tries to enhance its algorithm in at least one direction. Possible enhancements may enable the conditional, heuristical, incremental, or scalable identification of FDs. The team members should finally report on the effects and the quality of their individual approaches in an end-term presentation. To conclude the seminar, each team needs to prepare a paper-style submission of 4 pages.

Time schedule

To participate in this seminar, please join our first meeting on October 14 2013 in A-2.2. We will present an overview of possible topics, amongst which you can choose. The seminar is restricted to 6 participants, who will be selected randomly.

  • To register, send an email to Thorsten Papenbrock
  • Deadline for registration is 20.10.2013

Grading

The final grade is weighted by 6 LP and considers the following:

  • Active participation in meetings and discussions
  • Implementation of the baseline algorithm using the Metanome interface
  • Implementation of (at least one) algorithmic enhancement
  • Mid-term presentation
  • End-term presentation
  • Final paper-style submission

Prerequisites

  • Knowledge in programming Java is needed, since Metanome is written in Java
  • Knowledge in data profiling and in particular functional dependencies (e.g., from the Data Profiling and Data Cleansing lecture) is a nice-to-have, but it is not a prerequisite and can also be obtained in this course.