Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Description

Data profiling constitutes the process of automatically analyzing a given dataset for metadata. The different techniques used in this process reveal intra-column properties, inter-column dependencies and various table-wide characteristics. Once determined, this metadata enables the data owner to detect errors, integrate other sources, normalize schemata, or define additional attribute properties.

The information systems group has developed a profiling platform called Metanome, which incorporates various algorithms for the discovery of Inclusion Dependencies, Functional Dependencies, Unique Column Combinations, and various other metrics. In this seminar, we join the Metanome project and design advanced profiling algorithms to be used in practice. More specifically, we examine different algorithms for the identification of Inclusion Dependencies, improve their performance, and finally integrate them into Metanome.

Schedule

The lecture is taking place Mondays, from 01:30 pm - 03:00 pm at Campus III, G-3.E.15/16.

DateTopicSlides
Mo 16.10.Introductionpdf
Mo 23.10.Data profiling / Efficient java codepdf
Mo 30.10.Group allocationpdf
06.11. – 10.11.Individual meetings
Mo 13.11.

Algorithm presentations

Mo 20.11-18.12Weekly meetings to discuss the implementation
Christmas vacation
Mo 8.1.2018Intermediate presentation (implementation)
Mo 15.01 -29.01Weekly meetings to discuss repeatability algorithms
Mo 05.02Deadline to finalize repeatability Algorithms (start optimization phase)
Mo 12.02Deadline to start main experiments (start write-up)
Mo 19.02start individual experiments
Fr  23.02Deadline to be done with main experiments
Mo 26.02Final presentaion

Literature

The algorithms that we will look at in this seminar are the following:

Bell&Brockhausen: Bell, Siegfried, and Peter Brockhausen. 1995. “Discovery of Data Dependencies in Relational Databases.” Statistics, Machine Learning and Knowledge Discovery in Databases, ML–Net Familiarization Workshop, 53–58.

Zigzag: Marchi, Fabien De, and Jean-Marc Petit. 2003. “Zigzag: A New Algorithm for Mining Large Inclusion Dependencies in Databases.” In Proceedings of the International Conference on Data Mining (ICDM), 27–34.

FIND2: Koeller, Andreas, and Elke. A. Rundensteiner. 2003. “Discovery of High-Dimensional Inclusion Dependencies.” In Proceedings of the International Conference on Data Engineering (ICDE), 683–685.

SPIDER: Bauckmann, Jana, Ulf Leser, Felix Naumann, and Veronique Tietz. 2007. “Efficiently Detecting Inclusion Dependencies.” In Proceedings of the International Conference on Data Engineering (ICDE), 1448–1450.

deMarchi/MIND: Marchi, Fabien De, Stéphane Lopes, and Jean Marc Petit. 2009. “Unary and N-Ary Inclusion Dependency Discovery in Relational Databases.” Journal of Intelligent Information Systems 32 (1): 53–73.

BINDER: Papenbrock, Thorsten, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann. 2015. “Divide & Conquer-Based Inclusion Dependency Discovery.” In Proceedings of the VLDB Endowment, 8:774–785.

S-INDD: Shaabani, Nuhad, and Christoph Meinel. 2015. “Scalable Inclusion Dependency Discovery.” In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA), 425–440.

SINDY: Kruse, Sebastian, Thorsten Papenbrock, and Felix Naumann. 2015. “Scaling Out the Discovery of Inclusion Dependencies.” In Proceedings of the Conference Database Systems for Business, Technology and Web (BTW), 445–454.

MIND2: Shaabani, Nuhad, and Christoph Meinel. 2016. “Detecting Maximum Inclusion Dependencies without Candidate Generation.” Proceedings of the Conference International Conference on Database and Expert (DEXA), 118–133.

MANY: Tschirschnitz, Fabian, Thorsten Papenbrock, and Felix Naumann. 2017. “Detecting Inclusion Dependencies on Very Many Tables.” ACM Transactions on Database Systems (TODS) 1 (1): 1–30.

 

 

Approximate/Incremental/Partial discovery algorithms:

  • Dasu, Tamraparni, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk. 2002. “Mining Database Structure; Or, How to Build a Data Quality Browser.” In Proceedings of the International Conference on Management of Data (SIGMOD), 240–251.
  • Zhang, Meihui, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2010. “On Multi-Column Foreign Key Discovery.” Proceedings of the VLDB Endowment 3 (1–2): 805–814.

  • Shaabani, Nuhad, and Christoph Meinel. 2017. “Incremental Discovery of Inclusion Dependencies.” In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), 1–12.
  • Papenbrock, Thorsten, Christian Dullweber, Moritz Finke, Sebastian Kruse, Manuel Hegner, Martin Zabel, Christian Zöllner, and Felix Naumann. 2017. “Fast Approximate Discovery of Inclusion Dependencies.” In Proceedings of the Conference Database Systems for Business, Technology and Web (BTW), 207–226.

Other IND-related publications:

  • Lopes, Stéphane, Jean-Marc Petit, and Farouk Toumani. 2002. “Discovering Interesting Inclusion Dependencies: Application to Logical Database Tuning.” Information Systems 27 (1): 1–19.
  • Brown, Paul G., and Peter J. Hass. 2003. “BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data.” In Proceedings of the VLDB Endowment, 668–79. VLDB Endowment.
  • Marchi, Fabien De. 2011. “CLIM: Closed Inclusion Dependency Mining in Databases.” In Proceedings of the International Conference on Data Mining Workshops (ICDMW), 1098–1103.
  • Bauckmann, Jana, Ziawasch Abedjan, Ulf Leser, Heiko Müller, and Felix Naumann. 2012. “Discovering Conditional Inclusion Dependencies.” In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2094–2098.

See the following two articles for an overview on data profiling:

Organization

Students form teams of two members. Each team is assigned a set of IND algorithms and the according publications. After studying this (and further) literature, the teams should present and in parallel also implement their algorithms. All implementations need to integrate the Metanome interface to be compatible with this project.

Furthermore, we expect each team to design an experiment that can solve as a benchmark of the algorithms in a certain aspect. Therefore they need to find or produce an own dataset to evaluate the implementation. These datasets should also be passed to other teams so that we can compare the different approaches in the end. To present the baseline algorithms and the results of the first phase to the whole group, all teams will give a mid-term presentations.

In the second half of the seminar, each team tries to enhance another team's implementations. The team members should finally report on their improvements in a last presentation. To conclude the seminar, each team needs to engage in preparing a paper-style submission.

Grading

The final grade is weighted by 6 LP and considers the following:

  • Active participation in meetings and discussions
  • Implementation of the baseline algorithm using the Metanome interface
  • Implementation of (at least one) algorithmic enhancement
  • Mid-term presentation
  • End-term presentation
  • Final paper-style submission

Prerequisites

  • Knowledge in programming Java is needed, since Metanome is written in Java
  • Knowledge in Data Profiling and in particular inclusion dependencies (e.g., from the Data Profiling and Data Cleansing lecture) is a nice-to-have, but it is not a prerequisite and can also be obtained in this course.