Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Information Integration

Information integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. This lecture introduces this ever-important topic. It will cover the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, and data cleansing technologies.

Further Information:

  • Lectures can be given in English, on demand.
  • Slides will be made available on the HPI-internal materials-folder.
  • The lectures will be recorded by tele-task.
  • The exercises are led by Tim Repke.

Schedule

The course will take place Mondays in F-E.06 and Thursdays in H-2.57 at 09:15 AM. Some lectures will have the form of exercises.

To download lecture slides, please click links below.

DateTopic
MO 14.10.No lecture - HPI Plenary Meeting
TH 17.10.Exercise: Organization and Task Introduction
MO 21.10.Introduction
TH 24.10.Distribution, Autonomy, and Heterogeneity
MO 28.10. in F-E.06Exercise: Extracted Datasets
Reformation Day--
MO 04.11.Distribution, Autonomy, and Heterogeneity
TH 07.11.Materialized and Virtual Integration
MO 11.11.Exercise: Project Task Definition (analytics question)
TH 14.11. in F-E.06Web Table Research (Hazar Harmouch)
Architectures (Felix Naumann)
MO 18.11.no lecture
TH 21.11.no lecture
MO 25.11.Architectures & SchemaSQL
TH 28.11.SchemaSQL
MO 02.12.Schema Matching
TH 05.12.Exercise: Data Integration (schema matching, transformation, normalization)
MO 09.12.Schema Mapping
TH 12.12.Global-as-View
MO 16.12.Local-as-View
TH 19.12.Local-as-View
Christmas break--
MO 06.01.Bucket Algorithm (Dr. Armin Roth, Universität Tübingen)
TH 09.01.Exercise: Data Cleansing (duplicate detection, linkage, data fusion)
MO 13.01.Duplicate Detection
TH 16.01.Duplicate Detection
MO 20.01.Duplicate Detection
TH 23.01.Information Quality
MO 27.01.no lecture
TH 30.01.Scalable Data Cleansing (Dr. Jorge Quiane-Ruiz, TU Berlin)
MO 03.02.Exercise: Analytics (visualizations, etc to answer the initial question)
TH 06.02.Exam Preparation
Feb. 11 and 12Oral exams

Office Hours

If you have any questions relating the lecture or exercise, feel free to contact Tim Repke or come by during the office hours:

Every Monday, 14:00 - 15:00
Room F-2.07

Exceptions:

  • 09.11. (any other day this week per request)
  • 23.12. - 05.01.

Literature & Exam

  • Ulf Leser and Felix Naumann: Informationsintegration, dpunkt Verlag, 2006 (free pdf).
    This book is available at the UP library and also, e.g., from Amazon.de.
  • Doan, Halevy, and Ives: Principles of data integration, Morgan Kaufmann, 2012.
  • Özsu and Valduriez: Principles of distributed database systems, Springer, 2011.
  • Stefan Conrad: Föderierte Datenbanksysteme, Springer,  1997.

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Oral exams (30min) will take place on February 11th and 12th 2020, please contact Diana Stephan and check the doodle regarding the schedule.