Data Integration

Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.

Further Information:

Lectures will be given in English.
Please enroll yourself in our Moodle course by April 18, because we will use it to coordinate the exercises.

Exercises:

The exercises are led by Sebastian Schmidl.
It is necessary to pass the exercise to be admitted to take the exam.
In the exercises, you will work in teams of two on four different assignments: two about data profiling (multivalued dependencies) and two about data cleaning (duplicate detection).
We will probably use five of the normal lecture slots to introduce and discuss the exercises.

Schedule

The course will take place Mondays at 13:30 and Thursdays at 13:30 in L-E.03. Some lectures will have the form of exercises.

Date	Topic
Mon 8.4.2024	Introduction
Thu 11.4.2024	Introduction
Mon 15.4.2024	Distribution, autonomy, and heterogeneity
Thu 18.4.2024	Exercise Session 1: Data Profiling - Validation of Multivalued Dependencies (Publication Sheet 1)
Mon 22.4.2025	Adornments and Data Structures
Thu 25.4.2024	Data Profiling Introduction
Mon 29.4.2024	Unique Column Combinations and Keys
Wed 1.5.2024 (23:59)	Deadline Sheet 1! Publication Sheet 2
Thu 2.5.2024	Integration architectures
Mon 6.5.2024	Exercise session 2: Data Profiling - Discovery of Multivalued Dependencies
Thu 9.5.2024	Ascension
Mon 13.5.2024	Integration architectures (Dr. Fabian Panse)
Thu 16.5.2024	FD-Discovery and Normalization
Mon 20.5.2024	Pentecost
Thu 23.5.2024	FD-Discovery and Normalization
Sun 26.5.2024 (23:59)	Deadline Sheet 2, Publication Sheet 3
Mon 27.5.2024	Schema Mapping and Schema Matching
Thu 30.5.2024	Schema Mapping and Schema Matching
Mon 3.6.2024	Query planning
Thu 6.6.2024	Query planning
Mon 10.6.2024	Exercise session 3: Duplicate Detection - Matching
Thu 13.6.2024	no lecture
Mon 17.6.2024	Data quality and duplicate detection
Tue 18.6.2024 (23:59)	Deadline Sheet 3! Publication Sheet 4
Thu 20.6.2024 - moved to L-1.06	Data quality and duplicate detection
Mon 24.6.2024	Exercise session 4: Duplicate Detection - Blocking
Thu 27.6.2024	Data quality and duplicate detection
Mon 1.7.2024	no lecture
Thu 4.7.2024	Inclusion dependencies
Sun 7.7.2024 (23:59)	Deadline Sheet 4!
Mon 8.7.2024	Inclusion dependencies
Thu 11.7.2024	Inclusion dependencies
Mon 15.7.2024	Exercise session 5: Results and exam preparation
Thu 18.7.2024	no lecture

The exam is scheduled for Monday, July 29, 10-12am in HS 1.

Literature

Ulf Leser and Felix Naumann: Informationsintegration, dpunkt Verlag, 2006 (free pdf).
This book is available at the UP library and also, e.g., from Amazon.de.
Doan, Halevy, and Ives: Principles of data integration, Morgan Kaufmann, 2012.
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock: Data Profiling - Synthesis Lectures on Data Management
Özsu and Valduriez: Principles of distributed database systems, Springer, 2020.
Stefan Conrad: Föderierte Datenbanksysteme, Springer, 1997.

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Exam

Lecture grading is based 100% on the written exam (approx. 2h) after the end of the teaching period. Requirements for the exam admission are:

"Passing" all four exercises
At least one short presentation of an exercise solution