Data Integration
Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.
Further Information:
- Lectures will be given in English.
- Please enroll yourself in our Moodle course by April 18, because we will use it to coordinate the exercises.
Exercises:
- The exercises are led by Sebastian Schmidl.
- It is necessary to pass the exercise to be admitted to take the exam.
- In the exercises, you will work in teams of two on four different assignments: two about data profiling (multivalued dependencies) and two about data cleaning (duplicate detection).
- We will probably use five of the normal lecture slots to introduce and discuss the exercises.
Schedule
The course will take place Mondays at 13:30 and Thursdays at 13:30 in L-E.03. Some lectures will have the form of exercises.
| Date | Topic |
|---|---|
| Mon 8.4.2024 | Introduction |
| Thu 11.4.2024 | Introduction |
| Mon 15.4.2024 | Distribution, autonomy, and heterogeneity |
| Thu 18.4.2024 | Exercise Session 1: Data Profiling - Validation of Multivalued Dependencies (Publication Sheet 1) |
| Mon 22.4.2025 | Adornments and Data Structures |
| Thu 25.4.2024 | Data Profiling Introduction |
| Mon 29.4.2024 | Unique Column Combinations and Keys |
| Wed 1.5.2024 (23:59) | Deadline Sheet 1! Publication Sheet 2 |
| Thu 2.5.2024 | Integration architectures |
| Mon 6.5.2024 | Exercise session 2: Data Profiling - Discovery of Multivalued Dependencies |
| Thu 9.5.2024 | Ascension |
| Mon 13.5.2024 | Integration architectures (Dr. Fabian Panse) |
| Thu 16.5.2024 | FD-Discovery and Normalization |
| Mon 20.5.2024 | Pentecost |
| Thu 23.5.2024 | FD-Discovery and Normalization |
| Sun 26.5.2024 (23:59) | Deadline Sheet 2, Publication Sheet 3 |
| Mon 27.5.2024 | Schema Mapping and Schema Matching |
| Thu 30.5.2024 | Schema Mapping and Schema Matching |
| Mon 3.6.2024 | Query planning |
| Thu 6.6.2024 | Query planning |
| Mon 10.6.2024 | Exercise session 3: Duplicate Detection - Matching |
| Thu 13.6.2024 | no lecture |
| Mon 17.6.2024 | Data quality and duplicate detection |
| Tue 18.6.2024 (23:59) | Deadline Sheet 3! Publication Sheet 4 |
| Thu 20.6.2024 - moved to L-1.06 | Data quality and duplicate detection |
| Mon 24.6.2024 | Exercise session 4: Duplicate Detection - Blocking |
| Thu 27.6.2024 | Data quality and duplicate detection |
| Mon 1.7.2024 | no lecture |
| Thu 4.7.2024 | Inclusion dependencies |
| Sun 7.7.2024 (23:59) | Deadline Sheet 4! |
| Mon 8.7.2024 | Inclusion dependencies |
| Thu 11.7.2024 | Inclusion dependencies |
| Mon 15.7.2024 | Exercise session 5: Results and exam preparation |
| Thu 18.7.2024 | no lecture |
The exam is scheduled for Monday, July 29, 10-12am in HS 1.
Literature
- Ulf Leser and Felix Naumann: Informationsintegration, dpunkt Verlag, 2006 (free pdf).
This book is available at the UP library and also, e.g., from Amazon.de. - Doan, Halevy, and Ives: Principles of data integration, Morgan Kaufmann, 2012.
- Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock: Data Profiling - Synthesis Lectures on Data Management
- Özsu and Valduriez: Principles of distributed database systems, Springer, 2020.
- Stefan Conrad: Föderierte Datenbanksysteme, Springer, 1997.
Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.
Exam
Lecture grading is based 100% on the written exam (approx. 2h) after the end of the teaching period. Requirements for the exam admission are:
- "Passing" all four exercises
- At least one short presentation of an exercise solution