Advanced Data Profiling (Wintersemester 2023/2024)
Lecturer:
Prof. Dr. Felix Naumann
(Information Systems)
,
Sebastian Schmidl
(Information Systems)
,
Youri Kaminsky
,
Daniel Lindner
(Information Systems)
Course Website:
https://hpi.de/en/naumann/teaching/current-courses/ws-24-25/advanced-data-profiling.html
General Information
- Weekly Hours: 4
- Credits: 6
- Graded:
yes
- Enrolment Deadline: 01.10.2023 - 31.10.2023
- Teaching Form: Project seminar
- Enrolment Type: Compulsory Elective Module
- Course Language: English
- Maximum number of participants: 8
Programs, Module Groups & Modules
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
- DANA: Data Analytics
- HPI-DANA-K Konzepte und Methoden
- DANA: Data Analytics
- HPI-DANA-T Techniken und Werkzeuge
- DANA: Data Analytics
- HPI-DANA-S Spezialisierung
- CODS: Complex Data Systems
- HPI-CODS-K Konzepte und Methoden
- CODS: Complex Data Systems
- HPI-CODS-T Techniken und Werkzeuge
- CODS: Complex Data Systems
- HPI-CODS-S Spezialisierung
- DSYS: Data-Driven Systems
- HPI-DSYS-C Concepts and Methods
- DSYS: Data-Driven Systems
- HPI-DSYS-T Technologies and Tools
- DSYS: Data-Driven Systems
- HPI-DSYS-S Specialization
Description
Data profiling is the process of extracting metadata from datasets. One important task is the discovery of order dependencies (ODs), which capture the order relationship among attributes in a relational table. There are two prominent ways to express ODs: The list-based form and the set-based canonical form. Current state-of-the-art algorithms for the automatic discovery of order dependencies use the set-based form to benefit from the increased efficiency of a smaller search space. However, most OD usage scenarios require ODs in their list-based form. One example for the application of ODs is query optimization: If a user requests a relation to be ordered by multiple columns, the optimizer can reduce the number of performed sort operations if an OD holds. Notice that the SQL ORDER BY-statement uses lists of attributes. While the discovery algorithms output a complete set of minimal set-based ODs, we need to know if a certain, potentially non-minimal, list-based OD holds to perform the query rewrite. How do we efficiently check whether a given list-based OD can be derived from the set of minimal set-based ODs?
Finding a solution to the task is non-trivial due to the following three technical challenges:
- the complex transformation between list-based and set-based forms (factorial complexity)
- implementation of the known OD inference axioms for a membership test algorithm
- requirement of an efficient data structure to access potentially large collection of valid ODs (hundreds of thousands)
Requirements
- Prior knowledge in data profiling (preferably completed Data Profiling lecture)
- Good programming skills in a major programming language
Zurück