Advanced Data Profiling (Wintersemester 2017/2018)
Lecturer: Prof. Dr. Felix Naumann
- Weekly Hours: 4
- Credits: 6
- Enrolment Deadline: 27.10.2017
- Teaching Form: Project seminar
- Enrolment Type: Compulsory Elective Module
- Maximum number of participants: 12
Programs & Modules
- OSIS-Konzepte und Methoden
- OSIS-Techniken und Werkzeuge
Data profiling constitutes the process of automatically analyzing a given dataset for metadata. The different techniques used in this process reveal intra-column properties, inter-column dependencies and various table-wide characteristics. Once determined, this metadata enables the data owner to detect errors, integrate other sources, normalize schemata, or define additional attribute properties.
The information systems group has developed a profiling platform called Metanome, which incorporates various algorithms for the discovery of Inclusion Dependencies, Functional Dependencies, Unique Column Combinations, and various other metrics. In this seminar, we join the Metanome project and design advanced profiling algorithms to be used in practice. More specifically, we examine different algorithms for the identification of Inclusion Dependencies, improve their performance, and finally integrate them into Metanome.
- Knowledge in programming Java is needed, since Metanome is written in Java
- Knowledge in data profiling and in particular inclusion dependencies (e.g., from the DataProfiling and Data Cleansing lecture) is a nice-to-have, but it is not a prerequisite and can alsobe obtained in this course.
The algorithms that we will look at in this seminar are the following:
Bell&Brockhausen: Bell, Siegfried, and Peter Brockhausen. 1995. “Discovery of Data Dependencies in Relational Databases.” Statistics, Machine Learning and Knowledge Discovery in Databases, ML–Net Familiarization Workshop, 53–58.
Zigzag: Marchi, Fabien De, and Jean-Marc Petit. 2003. “Zigzag: A New Algorithm for Mining Large Inclusion Dependencies in Databases.” In Proceedings of the International Conference on Data Mining (ICDM), 27–34.
FIND2: Koeller, Andreas, and Elke. A. Rundensteiner. 2003. “Discovery of High-Dimensional Inclusion Dependencies.” In Proceedings of the International Conference on Data Engineering (ICDE), 683–685.
SPIDER: Bauckmann, Jana, Ulf Leser, Felix Naumann, and Veronique Tietz. 2007. “Efficiently Detecting Inclusion Dependencies.” In Proceedings of the International Conference on Data Engineering (ICDE), 1448–1450.
deMarchi/MIND: Marchi, Fabien De, Stéphane Lopes, and Jean Marc Petit. 2009. “Unary and N-Ary Inclusion Dependency Discovery in Relational Databases.” Journal of Intelligent Information Systems 32 (1): 53–73.
BINDER: Papenbrock, Thorsten, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann. 2015. “Divide & Conquer-Based Inclusion Dependency Discovery.” In Proceedings of the VLDB Endowment, 8:774–785.
S-INDD: Shaabani, Nuhad, and Christoph Meinel. 2015. “Scalable Inclusion Dependency Discovery.” In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA), 425–440.
SINDY: Kruse, Sebastian, Thorsten Papenbrock, and Felix Naumann. 2015. “Scaling Out the Discovery of Inclusion Dependencies.” In Proceedings of the Conference Database Systems for Business, Technology and Web (BTW), 445–454.
MIND2: Shaabani, Nuhad, and Christoph Meinel. 2016. “Detecting Maximum Inclusion Dependencies without Candidate Generation.” Proceedings of the Conference International Conference on Database and Expert (DEXA), 118–133.
MANY: Tschirschnitz, Fabian, Thorsten Papenbrock, and Felix Naumann. 2017. “Detecting Inclusion Dependencies on Very Many Tables.” ACM Transactions on Database Systems (TODS) 1 (1): 1–30.
Approximate/Incremental/Partial discovery algorithms:
- Dasu, Tamraparni, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk. 2002. “Mining Database Structure; Or, How to Build a Data Quality Browser.” In Proceedings of the International Conference on Management of Data (SIGMOD), 240–251.
Zhang, Meihui, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2010. “On Multi-Column Foreign Key Discovery.” Proceedings of the VLDB Endowment 3 (1–2): 805–814.
- Shaabani, Nuhad, and Christoph Meinel. 2017. “Incremental Discovery of Inclusion Dependencies.” In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), 1–12.
- Papenbrock, Thorsten, Christian Dullweber, Moritz Finke, Sebastian Kruse, Manuel Hegner, Martin Zabel, Christian Zöllner, and Felix Naumann. 2017. “Fast Approximate Discovery of Inclusion Dependencies.” In Proceedings of the Conference Database Systems for Business, Technology and Web (BTW), 207–226.
Other IND-related publications:
- Lopes, Stéphane, Jean-Marc Petit, and Farouk Toumani. 2002. “Discovering Interesting Inclusion Dependencies: Application to Logical Database Tuning.” Information Systems 27 (1): 1–19.
- Brown, Paul G., and Peter J. Hass. 2003. “BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data.” In Proceedings of the VLDB Endowment, 668–79. VLDB Endowment.
- Marchi, Fabien De. 2011. “CLIM: Closed Inclusion Dependency Mining in Databases.” In Proceedings of the International Conference on Data Mining Workshops (ICDMW), 1098–1103.
- Bauckmann, Jana, Ziawasch Abedjan, Ulf Leser, Heiko Müller, and Felix Naumann. 2012. “Discovering Conditional Inclusion Dependencies.” In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2094–2098.
See the following two articles for an overview on data profiling:
The final grade is weighted by 6 LP and considers the following:
- Active participation in meetings and discussions
- Implementation of the baseline algorithm using the Metanome interface
- Implementation of (at least one) algorithmic enhancement
- Mid-term presentation
- End-term presentation
- Final paper-style submission