Hasso-Plattner-Institut25 Jahre HPI
Hasso-Plattner-Institut25 Jahre HPI
 

Advanced Data Profiling (Wintersemester 2023/2024)

Dozent: Prof. Dr. Felix Naumann (Information Systems) , Sebastian Schmidl (Information Systems) , Youri Kaminsky , Daniel Lindner (Information Systems)
Website zum Kurs: https://hpi.de/naumann/teaching/current-courses/ws-24-25/advanced-data-profiling.html

Allgemeine Information

  • Semesterwochenstunden: 4
  • ECTS: 6
  • Benotet: Ja
  • Einschreibefrist: 01.10.2023 - 31.10.2023
  • Lehrform: Projektseminar
  • Belegungsart: Wahlpflichtmodul
  • Lehrsprache: Englisch
  • Maximale Teilnehmerzahl: 8

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA
Data Engineering MA
Software Systems Engineering MA
  • DSYS: Data-Driven Systems
    • HPI-DSYS-C Concepts and Methods
  • DSYS: Data-Driven Systems
    • HPI-DSYS-T Technologies and Tools
  • DSYS: Data-Driven Systems
    • HPI-DSYS-S Specialization

Beschreibung

Data profiling is the process of extracting metadata from datasets. One important task is the discovery of order dependencies (ODs), which capture the order relationship among attributes in a relational table. There are two prominent ways to express ODs: The list-based form and the set-based canonical form. Current state-of-the-art algorithms for the automatic discovery of order dependencies use the set-based form to benefit from the increased efficiency of a smaller search space. However, most OD usage scenarios require ODs in their list-based form. One example for the application of ODs is query optimization: If a user requests a relation to be ordered by multiple columns, the optimizer can reduce the number of performed sort operations if an OD holds. Notice that the SQL ORDER BY-statement uses lists of attributes. While the discovery algorithms output a complete set of minimal set-based ODs, we need to know if a certain, potentially non-minimal, list-based OD holds to perform the query rewrite. How do we efficiently check whether a given list-based OD can be derived from the set of minimal set-based ODs?

Finding a solution to the task is non-trivial due to the following three technical challenges:

  • the complex transformation between list-based and set-based forms (factorial complexity)
  • implementation of the known OD inference axioms for a membership test algorithm
  • requirement of an efficient data structure to access potentially large collection of valid ODs (hundreds of thousands)

Voraussetzungen

  • Prior knowledge in data profiling (preferably completed Data Profiling lecture)
  • Good programming skills in a major programming language

Zurück