Data Integration
Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.
Further Information:
- Lectures will be given in English.
- Please enroll yourself in our Moodle course: we will use it to coordinate the exercises.
Exercises:
- The exercises are led by Sedir Mohammed.
- It is necessary to pass the exercise to be admitted to take the exam.
- In the exercises, you will work in teams of two on four different assignments: two about data profiling (multivalued dependencies) and two about data cleaning (duplicate detection).
- We will use around five of the normal lecture slots to introduce and discuss the exercises.
Schedule
The course will take place Mondays and Thursdays at 13:30 in L-E.03. Some lectures will have the form of exercises.
| Date | Topic |
|---|---|
| MO 07.04. | Introduction + Entity Resolution |
| TH 10.04. - moved to HS 1 | Entity Resolution |
| MO 14.04. | Entity Resolution + Exercise Session 1: Entity Resolution - Matching |
| TH 17.04. | Entity Resolution |
| MO 21.04. | Easter |
| TH 24.04. | Introduction cont. |
| MO 28.04. | Heterogeneity |
| TH 01.05. | Tag der Arbeit |
| MO 05.05. * | no lecture |
| TH 08.05. - moved to L-1.02 | Exercise Session 2: 1. Entity Resolution - Matching 2. Entity Matching - Blocking |
| MO 12.05. | Integration Architectures |
| TH 15.05. | Integration Architectures |
| MO 19.05. * | Introduction to Data Profiling |
| TH 22.05. * - moved to L-1.02 | Exercise Session 3: 1. Entity Resolution - Blocking 2. Data Profiling - Dependency Validation |
| MO 26.05. | UCC Discovery |
| TH 29.05. | Ascension |
| MO 02.06. | UCC Discovery |
| TH 05.06. | FD Discovery |
| MO 09.06. | Pentecost |
| TH 12.06. | FD Discovery |
| MO 16.06. | Exercise Session 4: 1. Data Profiling - Dependency Validation 2. Data Profiling - Dependency Discovery |
| TH 19.06. | Schema Matching |
| MO 23.06. * | Schema Mapping see recorded lecture in tele-task |
| TH 26.06. * | no lecture |
| MO 30.06. | Global-as-View Query Processing |
| TH 03.07. | Exercise session 5: Data Profiling - Dependency Discovery Lecture: Global-as-View Query Processing |
| MO 07.07. | Local-as-View Query Processing |
| TH 10.07. | IND Discovery |
| MO 14.07. | IND Discovery |
| TH 17.07. - moved to G3.E.15/16 | tbd |
| TU 29.07. | Exam in L-1.06 |
Literature
- Ulf Leser and Felix Naumann: Informationsintegration, dpunkt Verlag, 2006 (free pdf).
This book is available at the UP library and also, e.g., from Amazon.de. - Doan, Halevy, and Ives: Principles of data integration, Morgan Kaufmann, 2012.
- Ziawasch Abedjan, Lukasz Golab, Felix Naumann, Thorsten Papenbrock: Data Profiling - Synthesis Lectures on Data Management
- Özsu and Valduriez: Principles of distributed database systems, Springer, 2020.
- Stefan Conrad: Föderierte Datenbanksysteme, Springer, 1997.
Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.
Exam
Lecture grading is based 100% on the written exam (approx. 2h) after the end of the teaching period. Requirements for the exam admission are:
- "Passing" all four exercises
- At least one short presentation of an exercise solution