Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Data Integration

Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.

Further Information:

  • Lectures will be given in English.
  • Please enroll yourself in our Moodle course: we will use it to coordinate the exercises.

Exercises:

  • The exercises are led by Sedir Mohammed.
  • It is necessary to pass the exercise to be admitted to take the exam.
  • In the exercises, you will work in teams of two on four different assignments: two about data profiling (multivalued dependencies) and two about data cleaning (duplicate detection).
  • We will use around five of the normal lecture slots to introduce and discuss the exercises.

Schedule

The course will take place Mondays and Thursdays at 13:30 in L-E.03. Some lectures will have the form of exercises.

DateTopic
MO 07.04.Introduction + Entity Resolution
TH 10.04. - moved to HS 1Entity Resolution
MO 14.04.Entity Resolution + Exercise Session 1: Entity Resolution - Matching
TH 17.04.Entity Resolution
MO 21.04.Easter
TH 24.04.Introduction cont.
MO 28.04.Heterogeneity
TH 01.05.Tag der Arbeit
MO 05.05. *no lecture
TH 08.05. - moved to L-1.02Exercise Session 2:
1. Entity Resolution - Matching
2. Entity Matching - Blocking
MO 12.05.Integration Architectures
TH 15.05.Integration Architectures
MO 19.05. *Introduction to Data Profiling
TH 22.05. * - moved to L-1.02Exercise Session 3: 
1. Entity Resolution - Blocking
2. Data Profiling - Dependency Validation 
MO 26.05.UCC Discovery
TH 29.05.Ascension
MO 02.06.UCC Discovery
TH 05.06.FD Discovery
MO 09.06.Pentecost
TH 12.06.FD Discovery
MO 16.06.Exercise Session 4: 
1. Data Profiling - Dependency Validation
2. Data Profiling - Dependency Discovery
TH 19.06.Schema Matching
MO 23.06. *Schema Mapping see recorded lecture in tele-task
TH 26.06. *no lecture
MO 30.06.Global-as-View Query Processing
TH 03.07.Exercise session 5: Data Profiling - Dependency Discovery
Lecture: Global-as-View Query Processing
MO 07.07.Local-as-View Query Processing
TH 10.07.IND Discovery
MO 14.07.IND Discovery
TH 17.07. - moved to G3.E.15/16 tbd
TU 29.07.Exam in L-1.06

Literature

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Exam

Lecture grading is based 100% on the written exam (approx. 2h) after the end of the teaching period. Requirements for the exam admission are:

  • "Passing" all four exercises
  • At least one short presentation of an exercise solution