Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Information Integration

Information integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. This lecture introduces this ever-important topic. It will cover the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, and data cleansing technologies.

Further Information:

  • Lectures can be given in English, on demand.
  • Slides will be made available on the HPI-internal "Materials" folder: navigate to FG-Informationssysteme - VL-Informationsintegration - 2022.
  • A recording of the previous edition (winter 2019/20) can be found on tele-task (in German).

Exercises:

  • The exercises are led by Tobias Bleifuß in collaboration with bakdata.
  • It is necessary to pass the exercise to be admitted to take the exam.
  • In the exercises, you will work in small teams to build an information platform for corporate data.
  • We recommend using either Java or Python for your implementation.
  • We will kick-off the exercises on April 28, 2022 together with bakdata.

Schedule

The course will take place Tuesdays at 15:15 and Thursdays at 9:15 in in L-E.03. Some lectures will have the form of exercises.

If needed, we will stream the lecture via Zoom; you can find the link in our "Materials" folders (navigate to FG-Informationssysteme - VL-Informationsintegration - 2022).

DateTopic
Tue 19.4.2022Introduction
Thu 21.4.2022Introduction and Heterogeneity
Tue 26.4.2022Heterogeneity
Thu 28.4.2022Exercise with bakdata
Tue 03.5.2022Heterogeneity
Thu 05.5.2022Materialized and virtual integration
Tue 10.5.2022No lecture
Thu 12.5.2022Exercise in HS3
Tue 17.5.2022Architectures
Thu 19.5.2022Multidatabase Query Languages
Tue 24.5.2022Schema Mapping / Schema Matching
Thu 26.5.2022Ascension
Tue 31.5.2022Schema Mapping / Schema Matching
Thu 02.6.2022Exercise
Tue 07.6.2022Schema Mapping / Schema Matching
Thu 09.6.2022Duplicate Detection
Tue 14.6.2022Exercise
Thu 16.6.2022Duplicate Detection
Tue 21.6.2022Duplicate Detection
Thu 23.6.2022Keine Vorlesung
Tue 28.6.2022"Generating Test Data for Duplicate Detection: State of the Art and Open Challenges" Dr. Fabian Panse (Universität Hamburg)
Thu 30.6.2022Global-as-View Modelling
Tue 05.7.2022Local-as-View Modelling
Thu 07.7.2022Exercise in L-1.02
Tue 12.7.2022Local-as-View Modelling
Thu 14.7.2022Bucket Algorithm
Tue 19.7.2022No lecture
Thu 21.7.2022Exercise
Tue 26.7.2022Exam preparation
Thu 28.7.2022Data Warehouses

The exam is scheduled for Monday, August 1, 2022 at 13:00-15:00 in HS1.

Literature & Exam

  • Ulf Leser and Felix Naumann: Informationsintegration, dpunkt Verlag, 2006 (free pdf).
    This book is available at the UP library and also, e.g., from Amazon.de.
  • Doan, Halevy, and Ives: Principles of data integration, Morgan Kaufmann, 2012.
  • Özsu and Valduriez: Principles of distributed database systems, Springer, 2011.
  • Stefan Conrad: Föderierte Datenbanksysteme, Springer,  1997.

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Depending on the number of participants, we will conduct a written or an oral exam.