Information Integration
Information integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. This lecture introduces this ever-important topic. It will cover the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, and data cleansing technologies.
Further Information:
- Lectures can be given in English, on demand.
- Slides will be made available on the HPI-internal "Materials" folder: navigate to FG-Informationssysteme - VL-Informationsintegration - 2022.
- A recording of the previous edition (winter 2019/20) can be found on tele-task (in German).
Exercises:
- The exercises are led by Tobias Bleifuß in collaboration with bakdata.
- It is necessary to pass the exercise to be admitted to take the exam.
- In the exercises, you will work in small teams to build an information platform for corporate data.
- We recommend using either Java or Python for your implementation.
- We will kick-off the exercises on April 28, 2022 together with bakdata.
Schedule
The course will take place Tuesdays at 15:15 and Thursdays at 9:15 in in L-E.03. Some lectures will have the form of exercises.
If needed, we will stream the lecture via Zoom; you can find the link in our "Materials" folders (navigate to FG-Informationssysteme - VL-Informationsintegration - 2022).
| Date | Topic |
|---|---|
| Tue 19.4.2022 | Introduction |
| Thu 21.4.2022 | Introduction and Heterogeneity |
| Tue 26.4.2022 | Heterogeneity |
| Thu 28.4.2022 | Exercise with bakdata |
| Tue 03.5.2022 | Heterogeneity |
| Thu 05.5.2022 | Materialized and virtual integration |
| Tue 10.5.2022 | No lecture |
| Thu 12.5.2022 | Exercise in HS3 |
| Tue 17.5.2022 | Architectures |
| Thu 19.5.2022 | Multidatabase Query Languages |
| Tue 24.5.2022 | Schema Mapping / Schema Matching |
| Thu 26.5.2022 | Ascension |
| Tue 31.5.2022 | Schema Mapping / Schema Matching |
| Thu 02.6.2022 | Exercise |
| Tue 07.6.2022 | Schema Mapping / Schema Matching |
| Thu 09.6.2022 | Duplicate Detection |
| Tue 14.6.2022 | Exercise |
| Thu 16.6.2022 | Duplicate Detection |
| Tue 21.6.2022 | Duplicate Detection |
| Thu 23.6.2022 | Keine Vorlesung |
| Tue 28.6.2022 | "Generating Test Data for Duplicate Detection: State of the Art and Open Challenges" Dr. Fabian Panse (Universität Hamburg) |
| Thu 30.6.2022 | Global-as-View Modelling |
| Tue 05.7.2022 | Local-as-View Modelling |
| Thu 07.7.2022 | Exercise in L-1.02 |
| Tue 12.7.2022 | Local-as-View Modelling |
| Thu 14.7.2022 | Bucket Algorithm |
| Tue 19.7.2022 | No lecture |
| Thu 21.7.2022 | Exercise |
| Tue 26.7.2022 | Exam preparation |
| Thu 28.7.2022 | Data Warehouses |
The exam is scheduled for Monday, August 1, 2022 at 13:00-15:00 in HS1.
Literature & Exam
- Ulf Leser and Felix Naumann: Informationsintegration, dpunkt Verlag, 2006 (free pdf).
This book is available at the UP library and also, e.g., from Amazon.de. - Doan, Halevy, and Ives: Principles of data integration, Morgan Kaufmann, 2012.
- Özsu and Valduriez: Principles of distributed database systems, Springer, 2011.
- Stefan Conrad: Föderierte Datenbanksysteme, Springer, 1997.
Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.
Depending on the number of participants, we will conduct a written or an oral exam.