Data Integration

Felix Naumann, Sedir Mohammed

Lecture and exercises for master's students

Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.

Further Information:

  • Lectures will be given in English.
  • Please enroll yourself in our Moodle course: we will use it to coordinate the exercises.
  • The lectures will not be recorded, but a recording from summer 2025 exists on tele-task.

Exercise/Project:

The exercise/project is led by Dr. Sedir Mohammed. The course-accompanying exercise for the lecture Data Integration undergoes a substantial redesign this semester. Instead of working on individual, mainly self-contained exercise tasks, students will participate in a semester-long project in which all participants collaboratively contribute to a larger data integration platform. The aim is to create a practical setting in which core lecture concepts are applied to a realistic end-to-end problem. At the same time, the format is intended to strengthen students’ experience with project work in larger teams, including both collaboration within teams and coordination across team boundaries.

The proposed application scenario is the integration of publication metadata associated with HPI researchers. Concretely, the project aims to collect, integrate, clean, store, and analyze publication information from multiple external sources, including DBLP, OpenAlex, Crossref and PubMed. This use case is particularly well-suited for the lecture since it naturally exposes students to heterogeneous sources, differences in schema and representation, duplicate records, data cleaning and fusion, integrated storage, and the analytical use of the resulting dataset. These are precisely the kinds of problems addressed in the lecture, including duplicate detection, schema mapping, materialized integration architectures, and the discovery of functional dependencies, whose results are then used as signals for schema refinement and normalization.

You can find the detailed project description here
 

Schedule

The course will take place Mondays at 11:00 and 13:30 in L-1.02. Some lectures will have the form of exercises.

DateTopic
MO 13.04., 11:00 AM
Location: HS 2
Introduction, Organisation, and Motivation
MO 13.04., 13:30 PM
Location: HS 2
Introduction, Organisation, and Motivation
MO 20.04., 11:00 AMHeterogeneity
MO 20.04., 13:30 PMHeterogeneity
*MO 27.04., 11:00 AMGuest workshop: SAP Data Architecture in Practice: Foundations for an AI-Ready Enterprise
*MO 27.04., 13:30 PMGuest workshop (continued)
MO 04.05., 11:00 AMMaterialized vs. Virtual Integration
MO 04.05., 13:30 PMMaterialized vs. Virtual Integration
MO 11.05., 11:00 AMSchema Mapping
MO 11.05., 13:30 PMProject kickoff and introduction of Milestone 1
MO 18.05., 11:00 AMSchema Mapping
MO 18.05., 13:30 PMQuery Planning
MO 25.05., 11:00 AMPfingstmontag - no lecture
MO 25.05., 13:30 PMPfingstmontag - no lecture
TU 26.05, 17:00 PM
Location: F-E.06
Review of Milestone 1 and introduction of Milestone 2
*MO 01.06., 11:00 AMDuplicate detection - a primer by Sedir Mohammed
*MO 01.06., 13:30 PMGuest lecture: Dr. Pablo Guerrero (data4life) “Integrating the World's Health Data: OHDSI and OMOP as a Real-World Case Study in Data Integration”
MO 08.06., 11:00 AMQuery Planning
MO 08.06., 13:30 PMReview of Milestone 2 and introduction of Milestone 3
MO 15.06., 11:00 AM
Location: F-2.10
Query planning
MO 15.06., 13:30 PM
Location: F-E.06
Inclusion Dependencies
MO 22.06., 11:00 AM
Location: F-E.06
Inclusion Dependencies
MO 22.06., 13:30 PM
Location: F-E.06
Review of Milestone 3 and introduction of Milestone 4
MO 29.06., 11:00 AM 
MO 29.06., 13:30 PM 
*MO 06.07., 11:00 AM 
MO 06.07., 13:30 PMReview of Milestone 4 and introduction of Milestone 5
MO 13.07., 11:00 AM 
MO 13.07., 13:30 PM 
MO 20.07., 11:00 AM 
MO 20.07., 13:30 PMFinal presentations and review of Milestone 5
TU 28.07., 15:00 - 18:00 PM, HS 2Written exam

Literature

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Exam

Lecture grading is based 100% on the written exam (approx. 2h) after the end of the teaching period. Requirements for the exam admission are:

  • "Passing" the project