Information Integration

Information integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. This lecture introduces this ever-important topic. It will cover the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, and data cleansing technologies.

Further Information:

  • Lectures can be given in English, on demand.
  • Slides will be made available on the HPI-internal "Materials" folder: navigate to FG-Informationssysteme - VL-Informationsintegration - 2022.
  • A recording of the previous edition (winter 2019/20) can be found on tele-task (in German).

Exercises:

  • The exercises are led by Tobias Bleifuß in collaboration with bakdata.
  • It is necessary to pass the exercise to be admitted to take the exam.
  • In the exercises, you will work in small teams to build an information platform for corporate data.
  • We recommend using either Java or Python for your implementation.
  • We will kick-off the exercises on April 28, 2022 together with bakdata.

Schedule

The course will take place Tuesdays at 15:15 and Thursdays at 9:15 in in L-E.03. Some lectures will have the form of exercises.

If needed, we will stream the lecture via Zoom; you can find the link in our "Materials" folders (navigate to FG-Informationssysteme - VL-Informationsintegration - 2022).

Date Topic
Tue 19.4.2022 Introduction
Thu 21.4.2022 Introduction and Heterogeneity
Tue 26.4.2022 Heterogeneity
Thu 28.4.2022 Exercise with bakdata
Tue 03.5.2022 Heterogeneity
Thu 05.5.2022 Materialized and virtual integration
Tue 10.5.2022 No lecture
Thu 12.5.2022 Exercise in HS3
Tue 17.5.2022 Architectures
Thu 19.5.2022 Multidatabase Query Languages
Tue 24.5.2022 Schema Mapping / Schema Matching
Thu 26.5.2022 Ascension
Tue 31.5.2022 Schema Mapping / Schema Matching
Thu 02.6.2022 Exercise
Tue 07.6.2022 Schema Mapping / Schema Matching
Thu 09.6.2022 Duplicate Detection
Tue 14.6.2022 Exercise
Thu 16.6.2022 Duplicate Detection
Tue 21.6.2022 Duplicate Detection
Thu 23.6.2022 Keine Vorlesung
Tue 28.6.2022 "Generating Test Data for Duplicate Detection: State of the Art and Open Challenges" Dr. Fabian Panse (Universität Hamburg)
Thu 30.6.2022 Global-as-View Modelling
Tue 05.7.2022 Local-as-View Modelling
Thu 07.7.2022 Exercise in L-1.02
Tue 12.7.2022 Local-as-View Modelling
Thu 14.7.2022 Bucket Algorithm
Tue 19.7.2022 No lecture
Thu 21.7.2022 Exercise
Tue 26.7.2022 Exam preparation
Thu 28.7.2022 Data Warehouses

The exam is scheduled for Monday, August 1, 2022 at 13:00-15:00 in HS1.

Literature & Exam

  • Ulf Leser and Felix Naumann: Informationsintegration, dpunkt Verlag, 2006 (free pdf).
    This book is available at the UP library and also, e.g., from Amazon.de.
  • Doan, Halevy, and Ives: Principles of data integration, Morgan Kaufmann, 2012.
  • Özsu and Valduriez: Principles of distributed database systems, Springer, 2011.
  • Stefan Conrad: Föderierte Datenbanksysteme, Springer,  1997.

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Depending on the number of participants, we will conduct a written or an oral exam.