Advanced Data Profiling (Wintersemester 2017/2018)

Dozent: Prof. Dr. Felix Naumann (Information Systems) , Tobias Bleifuß (Information Systems) , Lan Jiang (Information Systems)
Website zum Kurs: https://hpi.de/naumann/teaching/teaching/ws-1718/advanced-data-profiling-ps-master.html

Allgemeine Information

Semesterwochenstunden: 4
ECTS: 6
Benotet: Ja
Einschreibefrist: 27.10.2017
Lehrform: Projektseminar
Belegungsart: Wahlpflichtmodul
Maximale Teilnehmerzahl: 12

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA

IT-Systems Engineering
- HPI-ITSE-A Analyse
IT-Systems Engineering
- HPI-ITSE-E Entwurf
IT-Systems Engineering
- HPI-ITSE-K Konstruktion
IT-Systems Engineering
- HPI-ITSE-M Maintenance
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge

Beschreibung

Data profiling constitutes the process of automatically analyzing a given dataset for metadata. The different techniques used in this process reveal intra-column properties, inter-column dependencies and various table-wide characteristics. Once determined, this metadata enables the data owner to detect errors, integrate other sources, normalize schemata, or define additional attribute properties.

The information systems group has developed a profiling platform called Metanome, which incorporates various algorithms for the discovery of Inclusion Dependencies, Functional Dependencies, Unique Column Combinations, and various other metrics. In this seminar, we join the Metanome project and design advanced profiling algorithms to be used in practice. More specifically, we examine different algorithms for the identification of Inclusion Dependencies, improve their performance, and finally integrate them into Metanome.

Voraussetzungen

Knowledge in programming Java is needed, since Metanome is written in Java
Knowledge in data profiling and in particular inclusion dependencies (e.g., from the DataProfiling and Data Cleansing lecture) is a nice-to-have, but it is not a prerequisite and can alsobe obtained in this course.

Literatur

The algorithms that we will look at in this seminar are the following:

Bell&Brockhausen: Bell, Siegfried, and Peter Brockhausen. 1995. “Discovery of Data Dependencies in Relational Databases.” Statistics, Machine Learning and Knowledge Discovery in Databases, ML–Net Familiarization Workshop, 53–58.

Zigzag: Marchi, Fabien De, and Jean-Marc Petit. 2003. “Zigzag: A New Algorithm for Mining Large Inclusion Dependencies in Databases.” In Proceedings of the International Conference on Data Mining (ICDM), 27–34.

FIND2: Koeller, Andreas, and Elke. A. Rundensteiner. 2003. “Discovery of High-Dimensional Inclusion Dependencies.” In Proceedings of the International Conference on Data Engineering (ICDE), 683–685.

SPIDER: Bauckmann, Jana, Ulf Leser, Felix Naumann, and Veronique Tietz. 2007. “Efficiently Detecting Inclusion Dependencies.” In Proceedings of the International Conference on Data Engineering (ICDE), 1448–1450.

deMarchi/MIND: Marchi, Fabien De, Stéphane Lopes, and Jean Marc Petit. 2009. “Unary and N-Ary Inclusion Dependency Discovery in Relational Databases.” Journal of Intelligent Information Systems 32 (1): 53–73.

BINDER: Papenbrock, Thorsten, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann. 2015. “Divide & Conquer-Based Inclusion Dependency Discovery.” In Proceedings of the VLDB Endowment, 8:774–785.

S-INDD: Shaabani, Nuhad, and Christoph Meinel. 2015. “Scalable Inclusion Dependency Discovery.” In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA), 425–440.

SINDY: Kruse, Sebastian, Thorsten Papenbrock, and Felix Naumann. 2015. “Scaling Out the Discovery of Inclusion Dependencies.” In Proceedings of the Conference Database Systems for Business, Technology and Web (BTW), 445–454.

MIND2: Shaabani, Nuhad, and Christoph Meinel. 2016. “Detecting Maximum Inclusion Dependencies without Candidate Generation.” Proceedings of the Conference International Conference on Database and Expert (DEXA), 118–133.

MANY: Tschirschnitz, Fabian, Thorsten Papenbrock, and Felix Naumann. 2017. “Detecting Inclusion Dependencies on Very Many Tables.” ACM Transactions on Database Systems (TODS) 1 (1): 1–30.

Approximate/Incremental/Partial discovery algorithms:

Dasu, Tamraparni, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk. 2002. “Mining Database Structure; Or, How to Build a Data Quality Browser.” In Proceedings of the International Conference on Management of Data (SIGMOD), 240–251.
Zhang, Meihui, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2010. “On Multi-Column Foreign Key Discovery.” Proceedings of the VLDB Endowment 3 (1–2): 805–814.
Shaabani, Nuhad, and Christoph Meinel. 2017. “Incremental Discovery of Inclusion Dependencies.” In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), 1–12.
Papenbrock, Thorsten, Christian Dullweber, Moritz Finke, Sebastian Kruse, Manuel Hegner, Martin Zabel, Christian Zöllner, and Felix Naumann. 2017. “Fast Approximate Discovery of Inclusion Dependencies.” In Proceedings of the Conference Database Systems for Business, Technology and Web (BTW), 207–226.

Other IND-related publications:

Lopes, Stéphane, Jean-Marc Petit, and Farouk Toumani. 2002. “Discovering Interesting Inclusion Dependencies: Application to Logical Database Tuning.” Information Systems 27 (1): 1–19.
Brown, Paul G., and Peter J. Hass. 2003. “BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data.” In Proceedings of the VLDB Endowment, 668–79. VLDB Endowment.
Marchi, Fabien De. 2011. “CLIM: Closed Inclusion Dependency Mining in Databases.” In Proceedings of the International Conference on Data Mining Workshops (ICDMW), 1098–1103.
Bauckmann, Jana, Ziawasch Abedjan, Ulf Leser, Heiko Müller, and Felix Naumann. 2012. “Discovering Conditional Inclusion Dependencies.” In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2094–2098.

See the following two articles for an overview on data profiling:

A short introductory article: Felix Naumann, Data Profiling Revisited, SIGMOD Record 2013
A survey: Ziawasch Abedjan, Lukasz Golab and Felix Naumann: Profiling relational data: a survey, VLDB Journal vol 24(4), pages 557 - 581, 2015

Leistungserfassung

The final grade is weighted by 6 LP and considers the following:

Active participation in meetings and discussions
Implementation of the baseline algorithm using the Metanome interface
Implementation of (at least one) algorithmic enhancement
Mid-term presentation
End-term presentation
Final paper-style submission

Zurück