Advisors

Results

Algorithms and Datasets
This seminar was the basis for a successful submission to the VLDB 2015 Experiments & Analysis Track.

Description

Data profiling constitutes the process of automatically analyzing a given dataset for metadata. The different techniques used in this process reveal intra-column properties, inter-column dependencies and various table-wide characteristics. Once determined, this metadata enables the data owner to detect errors, integrate other sources, normalize schemata, or define additional attribute properties.

The information systems group is currently developing a profiling platform called Metanome, which incorporates various algorithms for the discovery of Inclusion Dependencies, Functional Dependencies, Unique Column Combinations, and various other metrics. In this seminar, we join the Metanome project and design advanced profiling algorithms to be used in practice. More specifically, we examine different algorithms for the identification of Functional Dependencies, improve their performance, and finally integrate them into Metanome.

Literature

The algorithms that we will look at in this seminar are the following:

TANE Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen, TANE: An efficient algorithm for discovering functional and approximate dependencies, The Computer Journal, vol. 42, no. 2, pp. 100-111, 1999.
fdep P. A. Flach, and I. Savnik, Database Dependency Discovery: A Machine Learning Approach, AI Communications, vol. 12, no. 3, pp. 139-160, 1999.
Dep-Miner S. Lopes, J. Petit, and L. Lakhal, Efficient Discovery of Functional Dependencies and Armstrong Relations, in Proceedings of the International Conference on Extending Database Technology (EDBT), 2000.
FastFDs C. M. Wyss, C. Giannella, and E. L. Robertson, FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances, in DaWaK, 2001.
FUN N. Novelli, and R. Cicchetti, FUN: An Efficient Algorithm for Mining Functional and Embedded Dependencies, in Proceedings of the International Conference on Database Theory (ICDT), 2001.
FD_Mine H. Yao, H. J. Hamilton, and C. J. Butz, FD_Mine: Discovering functional dependencies in a database using equivalences, in Proceedings of the IEEE International Conference on Data Mining (ICDM), 2002.

The following paper also focusses on FD_Mine, but since FD_Mine is the most recent approach for the identification of FDs, it nicely compares all previous algorithms:

H. Yao, and H. J. Hamilton, Mining functional dependencies from data, Data Mining and Knowledge Discovery, vol. 16, no. 2, pp. 197-219, 2008.

Organization

Students form teams of two members. Each team is assigned one profiling algorithm and the according publication. After studying this (and further) literature, the teams should implement and evaluate their algorithms. All implementations need to integrate the Metanome interface to be compatible with this project. Furthermore, we expect each team to find or produce an own dataset to evaluate their implementation. These datasets should also be passed to other teams so that we can compare the different approaches in the end. To present the baseline algorithms and the results of the first phase to the whole group, all teams will give short mid-term presentations.

In the second half of the seminar, each team tries to enhance its algorithm in at least one direction. Possible enhancements may enable the conditional, heuristical, incremental, or scalable identification of FDs. The team members should finally report on the effects and the quality of their individual approaches in an end-term presentation. To conclude the seminar, each team needs to prepare a paper-style submission of 4 pages.

Time schedule

To participate in this seminar, please join our first meeting on October 14 2013 in A-2.2. We will present an overview of possible topics, amongst which you can choose. The seminar is restricted to 6 participants, who will be selected randomly.

To register, send an email to Thorsten Papenbrock
Deadline for registration is 20.10.2013

Slides

Grading

The final grade is weighted by 6 LP and considers the following:

Active participation in meetings and discussions
Implementation of the baseline algorithm using the Metanome interface
Implementation of (at least one) algorithmic enhancement
Mid-term presentation
End-term presentation
Final paper-style submission

Prerequisites

Knowledge in programming Java is needed, since Metanome is written in Java
Knowledge in data profiling and in particular functional dependencies (e.g., from the Data Profiling and Data Cleansing lecture) is a nice-to-have, but it is not a prerequisite and can also be obtained in this course.