Data Profiling and Data Cleansing

Description

According to Wikipedia, data profiling is the process of examining the data available in an existing data source [...] and collecting statistics and information about that data. It encompasses a vast array of methods to examine data sets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute usually involve multiple columns, such as inclusion dependencies or functional dependencies between columns. More advanced techniques detect approximate properties or conditional properties of the data set at hand. The first part of the lecture examines efficient detection methods for these properties.

Data profiling is relevant as a preparatory step to many use cases, such as query optimization, data mining, data integration, and data cleansing.

Many of the insights gained during data profiling point to deficiencies of the data. Profiling reveals data errors, such as inconsistent formatting within a column, missing values, or outliers. Profiling results can also be used to measure and monitor the general quality of a dataset, for instance by determining the number of records that do not conform to previously established constraints. The second part of the lecture examines various methods and algorithms to improve the quality of data, with an emphasis on the many existing duplicate detection approaches.

Additional information

Lectures are given in English.
Slides are available on the HPI-internal materials-folder.
This lecture is a repetition from summer 2013.

Schedule

Schedule: Mondays at 15:15 and Thursdays at 11:00 in HS 2

The lecture is held in German, this year. It is recorded on tele-task.

ATTENTION: The following schedule is subject to change.

Date	Topic	Slides
MO 13.10.	Big Data Introduction
TH 16.10.	Exercise: Metanome and UCCs	slides, task, demo
MO 20.10.	Data Profiling Introduction
TH 23.10.	Data profiling challenges and outlook
MO 27.10.	Guest lecture: Unique column combinations (Arvid Heise)
TH 30.10.	no lecture
MO 03.11.	no lecture
TH 06.11.	no lecture
MO 10.11.	Inclusion dependencies
TH 13.11.	Conditional inclusion dependencies
MO 17.11.	Excercise: UCCs and INDs	slides, task
TH 20.11.	Thorsten Papenbrock & Sebastian Kruse: Advanced IND Detection Methods
MO 24.11.	Functional dependencies
TH 27.11.	Guest lecture: Yannick Saillet (IBM)
MO 01.12.	Functional dependencies
TH 04.12.	Jens Ehrlich & Fabian Tschirschnitz: Conditional Uniques & IND Detection at Scale
MO 08.12.	Excercise: INDs and FDs	slides, task
TH 11.12.	Functional dependencies
MO 15.12.	Introduction to Data Quality
TH 18.12.	Duplicate Detection
	Christmas holidays
MO 05.01.	Similarity measures
TH 08.01.	Similarity measures
MO 12.01.	Excercise: FDs and Duplicate Detection	slides, task
TH 15.01.	Sorted Neighborhood Methods
MO 19.01.	Sorted Neighborhood Methods
TH 22.01.	no lecture
MO 26.01.	Generic Entity Resolution
TH 29.01.	Anja Jentzsch: Profiling Linked Open Data
MO 02.02.	Excercise: Duplicate Detection	slides
TH 05.02.	Exam preparation
WED 11.02. (10 - 12am)	Exam in HS 1

Literature

The course does not follow a textbook. Each lecture references various scientific articles and other sources of information. Good sources to find those articles are

DBLP
ACM's Digital Library
Google Scholar
Author's homepages

Below is a list of books that are of general interest to the lecture

Data Profiling (mostly book about data mining)

A short introductory article: Felix Naumann, Data Profiling Revisited, SIGMOD Record 2013
Jiawei Han, Micheline Kamber, Jian Pei: Data Mining: Concepts and Techniques
Dorian Pyle: Data Preparation for Data Mining

Data Cleansing

Ulf Leser und Felix Naumann: Informationsintegration, dpunkt Verlag, 2006.
Das Buch steht vielfach in der Bibliothek und bei uns am Lehrstuhl. Außerdem z.B. bei Amazon.de.
Peter Christen: Data Matching, Springer
Das Buch steht 10x in der Bibliothek
Alon Halevy et al "Information Integration"
Tamer Özsu and Patrick Valduriez "Distributed Database Systems"
Stefan Conrad "Föderierte Datenbanksysteme"

Exam

A written exam will take place on February 11 from 10am until noon in HS 1.