Data Cleaning & Integration (Wintersemester 2023/2024)
Lecturer:
Prof. Dr. Felix Naumann
(Information Systems)
,
Fabian Panse
(Information Systems)
,
Matteo Paganelli
(Information Systems)
Course Website:
https://hpi.de/en/naumann/teaching/current-courses/ws-23-24/data-cleaning-and-integration.html
General Information
- Weekly Hours: 2
- Credits: 3
- Graded:
yes
- Enrolment Deadline: 01.10.2023 - 31.10.2023
- Examination time §9 (4) BAMA-O: 11.12.2023
- Teaching Form: Seminar
- Enrolment Type: Compulsory Elective Module
- Course Language: English
- Maximum number of participants: 8
Programs, Module Groups & Modules
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
- DANA: Data Analytics
- HPI-DANA-K Konzepte und Methoden
- DANA: Data Analytics
- HPI-DANA-T Techniken und Werkzeuge
- DANA: Data Analytics
- HPI-DANA-S Spezialisierung
- CODS: Complex Data Systems
- HPI-CODS-K Konzepte und Methoden
- CODS: Complex Data Systems
- HPI-CODS-T Techniken und Werkzeuge
- CODS: Complex Data Systems
- HPI-CODS-S Spezialisierung
- DSYS: Data-Driven Systems
- HPI-DSYS-C Concepts and Methods
- DSYS: Data-Driven Systems
- HPI-DSYS-T Technologies and Tools
- DSYS: Data-Driven Systems
- HPI-DSYS-S Specialization
Description
The integration of data has been a highly regarded field of research for decades and with increasing digitization, it is becoming more and more important. However, such integration has to overcome many hurdles and deal with a variety of problems. Among others, these problems include schema matching, duplicate detection, and record fusion.
The quality of data has a critical influence on the result of its processing, and thus directly affects important real world processes, such as business decisions or patient treatment. Therefore data cleansing is an important aspect of data management. The goal of a data cleansing process is to detect and correct errors in data. Errors can affect single attribute values, but can also span multiple attribute values, records, or even tables. These can be typos, phonetic errors, OCR errors, semantic errors, missing values or records, incorrect formatting, incorrect abstraction levels (e.g., district instead of city), obsolete values, incorrect foreign key references, or swaps between two attribute values. Detecting errors can be done in several ways. Among other things, this can be accomplished using statistical methods, but also with the help of integrity conditions.
Examination
Scientific presentation and scientific report
Zurück