Approximate Data Profiling (Wintersemester 2022/2023)
Dozent:
Prof. Dr. Felix Naumann
(Information Systems)
,
Tobias Bleifuß
(Information Systems)
,
Youri Kaminsky
Website zum Kurs:
https://hpi.de/naumann/teaching/current-courses/ws-22-23/approximate-data-profiling.html
Allgemeine Information
- Semesterwochenstunden: 4
- ECTS: 6
- Benotet:
Ja
- Einschreibefrist: 01.10.2022 - 30.10.2022
- Prüfungszeitpunkt §9 (4) BAMA-O: 08.12.2022
- Lehrform: Projektseminar
- Belegungsart: Wahlpflichtmodul
- Lehrsprache: Englisch
- Maximale Teilnehmerzahl: 6
Studiengänge, Modulgruppen & Module
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
- DANA: Data Analytics
- HPI-DANA-K Konzepte und Methoden
- DANA: Data Analytics
- HPI-DANA-T Techniken und Werkzeuge
- DANA: Data Analytics
- HPI-DANA-S Spezialisierung
- CODS: Complex Data Systems
- HPI-CODS-K Konzepte und Methoden
- CODS: Complex Data Systems
- HPI-CODS-T Techniken und Werkzeuge
- CODS: Complex Data Systems
- HPI-CODS-S Spezialisierung
- DSYS: Data-Driven Systems
- HPI-DSYS-C Concepts and Methods
- DSYS: Data-Driven Systems
- HPI-DSYS-T Technologies and Tools
- DSYS: Data-Driven Systems
- HPI-DSYS-S Specialization
Beschreibung
Data profiling is the process of extracting metadata from datasets. One important aspect is the discovery of data dependencies, such as Functional Dependencies (FDs), Inclusion Dependencies (INDs) and Unique Column Combinations (UCCs). However, the increasing size of datasets presents a challenge to traditional approaches of data profiling. Therefore, this seminar focuses on sampling-based methods for approximate data profiling.
First, the students become familiar with related work as an inspiration. Afterwards, each student team develops own ideas. These can concern both the sampling process itself or the actual discovery in the sample.
The students turn their ideas into working algorithms. There are two main goals for each algorithm:
1) Find a set of dependencies that is close to the actual solution.
2) Minimize the required runtime.
Benchmark Datasets are provided to the students.
Finally, the students present their approaches and write a short report.
Literatur
Lern- und Lehrformen
Project seminar with weekly meetings, talks, discussions and report writing
Leistungserfassung
Presentation and report
Termine
See webpage.
Zurück