Data Integration

Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.

Further Information:

  • Lectures will be given in English.
  • Please enroll yourself in our Moodle course by April 18, because we will use it to coordinate the exercises.

Exercises:

  • The exercises are led by Sebastian Schmidl.
  • It is necessary to pass the exercise to be admitted to take the exam.
  • In the exercises, you will work in teams of two on four different assignments: two about data profiling (multivalued dependencies) and two about data cleaning (duplicate detection).
  • We will probably use five of the normal lecture slots to introduce and discuss the exercises.

Schedule

The course will take place Mondays at 13:30 and Thursdays at 13:30 in L-E.03. Some lectures will have the form of exercises.

Date Topic
Mon 8.4.2024 Introduction
Thu 11.4.2024 Introduction
Mon 15.4.2024 Distribution, autonomy, and heterogeneity
Thu 18.4.2024 Exercise Session 1: Data Profiling - Validation of Multivalued Dependencies (Publication Sheet 1)
Mon 22.4.2025 Adornments and Data Structures
Thu 25.4.2024 Data Profiling Introduction
Mon 29.4.2024 Unique Column Combinations and Keys
Wed 1.5.2024 (23:59) Deadline Sheet 1! Publication Sheet 2
Thu 2.5.2024 Integration architectures
Mon 6.5.2024 Exercise session 2: Data Profiling - Discovery of Multivalued Dependencies
Thu 9.5.2024 Ascension
Mon 13.5.2024 Integration architectures (Dr. Fabian Panse)
Thu 16.5.2024 FD-Discovery and Normalization
Mon 20.5.2024 Pentecost
Thu 23.5.2024 FD-Discovery and Normalization
Sun 26.5.2024 (23:59) Deadline Sheet 2, Publication Sheet 3
Mon 27.5.2024 Schema Mapping and Schema Matching
Thu 30.5.2024 Schema Mapping and Schema Matching
Mon 3.6.2024 Query planning
Thu 6.6.2024 Query planning
Mon 10.6.2024 Exercise session 3: Duplicate Detection - Matching
Thu 13.6.2024 no lecture
Mon 17.6.2024 Data quality and duplicate detection
Tue 18.6.2024 (23:59) Deadline Sheet 3! Publication Sheet 4
Thu 20.6.2024 - moved to L-1.06 Data quality and duplicate detection
Mon 24.6.2024 Exercise session 4: Duplicate Detection - Blocking
Thu 27.6.2024 Data quality and duplicate detection
Mon 1.7.2024 no lecture
Thu 4.7.2024 Inclusion dependencies
Sun 7.7.2024 (23:59) Deadline Sheet 4!
Mon 8.7.2024 Inclusion dependencies
Thu 11.7.2024 Inclusion dependencies
Mon 15.7.2024 Exercise session 5: Results and exam preparation
Thu 18.7.2024 no lecture

The exam is scheduled for Monday, July 29, 10-12am in HS 1.

Literature

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Exam

Lecture grading is based 100% on the written exam (approx. 2h) after the end of the teaching period. Requirements for the exam admission are:

  • "Passing" all four exercises
  • At least one short presentation of an exercise solution