Data Integration
Felix Naumann, Sedir Mohammed
Lecture and exercises for master's students
Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.
Further Information:
- Lectures will be given in English.
- Please enroll yourself in our Moodle course: we will use it to coordinate the exercises.
- The lectures will not be recorded, but a recording from summer 2025 exists on tele-task.
Exercise/Project:
The exercise/project is led by Dr. Sedir Mohammed. The course-accompanying exercise for the lecture Data Integration undergoes a substantial redesign this semester. Instead of working on individual, mainly self-contained exercise tasks, students will participate in a semester-long project in which all participants collaboratively contribute to a larger data integration platform. The aim is to create a practical setting in which core lecture concepts are applied to a realistic end-to-end problem. At the same time, the format is intended to strengthen students’ experience with project work in larger teams, including both collaboration within teams and coordination across team boundaries.
The proposed application scenario is the integration of publication metadata associated with HPI researchers. Concretely, the project aims to collect, integrate, clean, store, and analyze publication information from multiple external sources, including DBLP, OpenAlex, Crossref and PubMed. This use case is particularly well-suited for the lecture since it naturally exposes students to heterogeneous sources, differences in schema and representation, duplicate records, data cleaning and fusion, integrated storage, and the analytical use of the resulting dataset. These are precisely the kinds of problems addressed in the lecture, including duplicate detection, schema mapping, materialized integration architectures, and the discovery of functional dependencies, whose results are then used as signals for schema refinement and normalization.
You can find the detailed project description here
Schedule
The course will take place Mondays at 11:00 and 13:30 in L-1.02. Some lectures will have the form of exercises.
| Date | Topic |
|---|---|
| MO 13.04., 11:00 AM Location: HS 2 | Introduction, Organisation, and Motivation |
| MO 13.04., 13:30 PM Location: HS 2 | Introduction, Organisation, and Motivation |
| MO 20.04., 11:00 AM | Heterogeneity |
| MO 20.04., 13:30 PM | Heterogeneity |
| *MO 27.04., 11:00 AM | Guest workshop: SAP Data Architecture in Practice: Foundations for an AI-Ready Enterprise |
| *MO 27.04., 13:30 PM | Guest workshop (continued) |
| MO 04.05., 11:00 AM | Materialized vs. Virtual Integration |
| MO 04.05., 13:30 PM | Materialized vs. Virtual Integration |
| MO 11.05., 11:00 AM | Schema Mapping |
| MO 11.05., 13:30 PM | Project kickoff and introduction of Milestone 1 |
| MO 18.05., 11:00 AM | Schema Mapping |
| MO 18.05., 13:30 PM | Query Planning |
| MO 25.05., 11:00 AM | Pfingstmontag - no lecture |
| MO 25.05., 13:30 PM | Pfingstmontag - no lecture |
| TU 26.05, 17:00 PM Location: F-E.06 | Review of Milestone 1 and introduction of Milestone 2 |
| *MO 01.06., 11:00 AM | Duplicate detection - a primer by Sedir Mohammed |
| *MO 01.06., 13:30 PM | Guest lecture: Dr. Pablo Guerrero (data4life) “Integrating the World's Health Data: OHDSI and OMOP as a Real-World Case Study in Data Integration” |
| MO 08.06., 11:00 AM | Query Planning |
| MO 08.06., 13:30 PM | Review of Milestone 2 and introduction of Milestone 3 |
| MO 15.06., 11:00 AM Location: F-2.10 | Query planning |
| MO 15.06., 13:30 PM Location: F-E.06 | Inclusion Dependencies |
| MO 22.06., 11:00 AM Location: F-E.06 | Inclusion Dependencies |
| MO 22.06., 13:30 PM Location: F-E.06 | Review of Milestone 3 and introduction of Milestone 4 |
| MO 29.06., 11:00 AM | |
| MO 29.06., 13:30 PM | |
| *MO 06.07., 11:00 AM | |
| MO 06.07., 13:30 PM | Review of Milestone 4 and introduction of Milestone 5 |
| MO 13.07., 11:00 AM | |
| MO 13.07., 13:30 PM | |
| MO 20.07., 11:00 AM | |
| MO 20.07., 13:30 PM | Final presentations and review of Milestone 5 |
| TU 28.07., 15:00 - 18:00 PM, HS 2 | Written exam |
Literature
- Ulf Leser and Felix Naumann: Informationsintegration, dpunkt Verlag, 2006 (free pdf).
This book is available at the UP library and also, e.g., from Amazon.de. - Doan, Halevy, and Ives: Principles of Data Integration, Morgan Kaufmann, 2012.
- Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock: Data Profiling - Synthesis Lectures on Data Management
- Özsu and Valduriez: Principles of distributed database systems, Springer, 2020.
- Stefan Conrad: Föderierte Datenbanksysteme, Springer, 1997.
Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.
Exam
Lecture grading is based 100% on the written exam (approx. 2h) after the end of the teaching period. Requirements for the exam admission are:
- "Passing" the project