Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.
Further Information:
- Lectures will be given in English.
- Please enroll yourself in our Moodle course by April 18, because we will use it to coordinate the exercises.
Exercises:
- The exercises are led by Sebastian Schmidl.
- It is necessary to pass the exercise to be admitted to take the exam.
- In the exercises, you will work in teams of two on four different assignments: two about data profiling (multivalued dependencies) and two about data cleaning (duplicate detection).
- We will probably use five of the normal lecture slots to introduce and discuss the exercises.