Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Data Integration

Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.

Further Information:

  • Lectures will be given in English.
  • Please enroll yourself in our Moodle course by April 18, because we will use it to coordinate the exercises.

Exercises:

  • The exercises are led by Sebastian Schmidl.
  • It is necessary to pass the exercise to be admitted to take the exam.
  • In the exercises, you will work in teams of two on four different assignments: two about data profiling (multivalued dependencies) and two about data cleaning (duplicate detection).
  • We will probably use five of the normal lecture slots to introduce and discuss the exercises.

Schedule

The course will take place Mondays at 13:30 and Thursdays at 13:30 in L-E.03. Some lectures will have the form of exercises.

DateTopic
Mon 8.4.2024Introduction
Thu 11.4.2024Introduction
Mon 15.4.2024Distribution, autonomy, and heterogeneity
Thu 18.4.2024Exercise Session 1: Data Profiling - Validation of Multivalued Dependencies (Publication Sheet 1)
Mon 22.4.2025Adornments and Data Structures
Thu 25.4.2024Data Profiling Introduction
Mon 29.4.2024Unique Column Combinations and Keys
Wed 1.5.2024 (23:59)Deadline Sheet 1! Publication Sheet 2
Thu 2.5.2024Integration architectures
Mon 6.5.2024Exercise session 2: Data Profiling - Discovery of Multivalued Dependencies
Thu 9.5.2024Ascension
Mon 13.5.2024Integration architectures (Dr. Fabian Panse)
Thu 16.5.2024FD-Discovery and Normalization
Mon 20.5.2024Pentecost
Thu 23.5.2024FD-Discovery and Normalization
Sun 26.5.2024 (23:59)Deadline Sheet 2, Publication Sheet 3
Mon 27.5.2024Schema Mapping and Schema Matching
Thu 30.5.2024Schema Mapping and Schema Matching
Mon 3.6.2024Query planning
Thu 6.6.2024Query planning
Mon 10.6.2024Exercise session 3: Duplicate Detection - Matching
Thu 13.6.2024no lecture
Mon 17.6.2024Data quality and duplicate detection
Tue 18.6.2024 (23:59)Deadline Sheet 3! Publication Sheet 4
Thu 20.6.2024 - moved to L-1.06Data quality and duplicate detection
Mon 24.6.2024Exercise session 4: Duplicate Detection - Blocking
Thu 27.6.2024Data quality and duplicate detection
Mon 1.7.2024no lecture
Thu 4.7.2024Inclusion dependencies
Sun 7.7.2024 (23:59)Deadline Sheet 4!
Mon 8.7.2024Inclusion dependencies
Thu 11.7.2024Inclusion dependencies
Mon 15.7.2024Exercise session 5: Results and exam preparation
Thu 18.7.2024no lecture

The exam is scheduled for Monday, July 29, 10-12am in HS 1.

Literature

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Exam

Lecture grading is based 100% on the written exam (approx. 2h) after the end of the teaching period. Requirements for the exam admission are:

  • "Passing" all four exercises
  • At least one short presentation of an exercise solution