Data Integration

Felix Naumann, Sedir Mohammed

Lecture and exercises for master's students

Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.

Further Information:

Lectures will be given in English.
Please enroll yourself in our Moodle course: we will use it to coordinate the exercises.
The lectures will not be recorded, but a recording from summer 2025 exists on tele-task.

Exercise/Project:

The exercise/project is led by Dr. Sedir Mohammed. The course-accompanying exercise for the lecture Data Integration undergoes a substantial redesign this semester. Instead of working on individual, mainly self-contained exercise tasks, students will participate in a semester-long project in which all participants collaboratively contribute to a larger data integration platform. The aim is to create a practical setting in which core lecture concepts are applied to a realistic end-to-end problem. At the same time, the format is intended to strengthen students’ experience with project work in larger teams, including both collaboration within teams and coordination across team boundaries.

The proposed application scenario is the integration of publication metadata associated with HPI researchers. Concretely, the project aims to collect, integrate, clean, store, and analyze publication information from multiple external sources, including DBLP, OpenAlex, Crossref and PubMed. This use case is particularly well-suited for the lecture since it naturally exposes students to heterogeneous sources, differences in schema and representation, duplicate records, data cleaning and fusion, integrated storage, and the analytical use of the resulting dataset. These are precisely the kinds of problems addressed in the lecture, including duplicate detection, schema mapping, materialized integration architectures, and the discovery of functional dependencies, whose results are then used as signals for schema refinement and normalization.

You can find the detailed project description here

Schedule

The course will take place Mondays at 11:00 and 13:30 in L-1.02. Some lectures will have the form of exercises.

Date	Topic
MO 13.04., 11:00 AM Location: HS 2	Introduction, Organisation, and Motivation
MO 13.04., 13:30 PM Location: HS 2	Introduction, Organisation, and Motivation
MO 20.04., 11:00 AM	Heterogeneity
MO 20.04., 13:30 PM	Heterogeneity
*MO 27.04., 11:00 AM	Guest workshop: SAP Data Architecture in Practice: Foundations for an AI-Ready Enterprise
*MO 27.04., 13:30 PM	Guest workshop (continued)
MO 04.05., 11:00 AM	Materialized vs. Virtual Integration
MO 04.05., 13:30 PM	Materialized vs. Virtual Integration
MO 11.05., 11:00 AM	Schema Mapping
MO 11.05., 13:30 PM	Project kickoff and introduction of Milestone 1
MO 18.05., 11:00 AM	Schema Mapping
MO 18.05., 13:30 PM	Query Planning
MO 25.05., 11:00 AM	Pfingstmontag - no lecture
MO 25.05., 13:30 PM	Pfingstmontag - no lecture
TU 26.05, 17:00 PM Location: F-E.06	Review of Milestone 1 and introduction of Milestone 2
*MO 01.06., 11:00 AM	Duplicate detection - a primer by Sedir Mohammed
*MO 01.06., 13:30 PM	Guest lecture: Dr. Pablo Guerrero (data4life) “Integrating the World's Health Data: OHDSI and OMOP as a Real-World Case Study in Data Integration”
MO 08.06., 11:00 AM	Query Planning
MO 08.06., 13:30 PM	Review of Milestone 2 and introduction of Milestone 3
MO 15.06., 11:00 AM Location: F-2.10	Query planning
MO 15.06., 13:30 PM Location: F-E.06	Inclusion Dependencies
MO 22.06., 11:00 AM Location: F-E.06	Inclusion Dependency Discovery
MO 22.06., 13:30 PM Location: F-E.06	Review of Milestone 3 and introduction of Milestone 4
MO 29.06., 11:00 AM	Inclusion Dependency Discovery
MO 29.06., 13:30 PM	Inclusion Dependency Discovery
*MO 06.07., 11:00 AM	Introduction to Data Profiling
MO 06.07., 13:30 PM	Review of Milestone 4 and introduction of Milestone 5
MO 13.07., 11:00 AM	cancelled due to illness
MO 13.07., 13:30 PM	cancelled due to illness
MO 20.07., 11:00 AM	(Unique Column Combination (UCC) Discovery)
MO 20.07., 13:30 PM	Final presentations and review of Milestone 5
TU 28.07., 15:00 - 18:00 PM, HS 2	Written exam

Literature

Ulf Leser and Felix Naumann: Informationsintegration, dpunkt Verlag, 2006 (free pdf).
This book is available at the UP library and also, e.g., from Amazon.de.
Doan, Halevy, and Ives: Principles of Data Integration, Morgan Kaufmann, 2012.
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock: Data Profiling - Synthesis Lectures on Data Management
Özsu and Valduriez: Principles of distributed database systems, Springer, 2020.
Stefan Conrad: Föderierte Datenbanksysteme, Springer, 1997.

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Exam

Lecture grading is based 100% on the written exam (approx. 2h) after the end of the teaching period. Requirements for the exam admission are:

"Passing" the project