Data Profiling and Data Cleansing

Description

According to Wikipedia, data profiling is the process of examining the data available in an existing data source [...] and collecting statistics and information about that data. It encompasses a vast array of methods to examine data sets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute usually involve multiple columns, such as inclusion dependencies or functional dependencies between columns. More advanced techniques detect approximate properties or conditional properties of the data set at hand. The first part of the lecture examines efficient detection methods for these properties.

Data profiling is relevant as a preparatory step to many use cases, such as query optimization, data mining, data integration, and data cleansing.

Many of the insights gained during data profiling point to deficiencies of the data. Profiling reveals data errors, such as inconsistent formatting within a column, missing values, or outliers. Profiling results can also be used to measure and monitor the general quality of a dataset, for instance by determining the number of records that do not conform to previously established constraints. The second part of the lecture examines various methods and algorithms to improve the quality of data, with an emphasis on the many existing duplicate detection approaches.

Additional information

Lectures are given in English.
Slides are available on the HPI-internal materials-folder.

Schedule

Schedule: Tuesdays and Thursdays 9:15 - 10:45 in HS 1

ATTENTION: The following schedule is subject to changes!

Date	Topic	Slides
DI 9.4.2013	Introduction and motivation	pdf
DO 11.4.2013	Introduction to data profiling
DI 16.4.2013	Exercise 1: Uniqueness detection
DO 18.4.2013	Data profiling challenges and vision	pdf
DI 23.4.2013	Guestlecture: Arvid Heise Unique column combinations	pdf
DO 25.4.2013	Inclusion Dependencies	pdf
DI 30.4.2013	Guestlecture Jana Bauckmann: Conditional inclusion dependencies	pdf
DO 2.5.2013	Exercise 2: Inclusion dependencies
DI 7.5.2013	Guestlecture Yannick Saillet: IBM Information Analyzer
DO 9.5.2013	Christi Himmerfahrt
DI 14.5.2013	no lecture
DO 16.5.2013	Exercise 3: Functional dependencies
DI 21.5.2013	Functional dependencies	pdf
DO 23.5.2013	Guestlecture Niels Weigel: Data Profiling - Use Cases, Tools, and Solutions at SAP

DI 28.5.2013	Guestlecture Anja Jentzsch: Profiling linked open data	pdf
DO 30.5.2013	Exercise 4: Presentations functional dependencies
DI 4.6.2013 ATTENTION: H-2.57	Data quality	pdf
DO 6.6.2013	Duplicate detection + Handout for duplicate detection exercise	pdf
DI 11.6.2013	Similarity measures	pdf
DO 13.6.2013	Similarity Measures
DI 18.6.2013	no lecture
DO 20.6.2013	Generic Entity Resolution with Swoosh	pdf
DI 25.6.2013	Exercise 5: Presentations Duplicate Detection
DO 27.6.2013	Exercise 6: Presentations Duplicate Detection
DI 2.7.2013 ATTENTION: H-E.51	Sorted Neighborhood Methods	pdf
DO 4.7.2013	no lecture
DI 9.7.2013	Sorted Neighborhood Methods
DO 11.7.2013	Joint Entity Resolution	pdf

Literature

The course does not follow a textbook. Each lecture references various scientific articles and other sources of information. Good sources to find those articles are

DBLP
ACM's Digital Library
Google Scholar
Author's homepages

Below is a list of books that are of general interest to the lecture

Data Profiling (mostly book about data mining)

Jiawei Han, Micheline Kamber, Jian Pei: Data Mining: Concepts and Techniques
Dorian Pyle: Data Preparation for Data Mining

Data Cleansing

Ulf Leser und Felix Naumann: Informationsintegration, dpunkt Verlag, 2006.
Das Buch steht vielfach in der Bibliothek und bei uns am Lehrstuhl. Außerdem z.B. bei Amazon.de.
Peter Christen: Data Matching, Springer
Das Buch steht 10x in der Bibliothek
Alon Halevy et al "Information Integration"
Tamer Özsu and Patrick Valduriez "Distributed Database Systems"
Stefan Conrad "Föderierte Datenbanksysteme"

Exam

Tuesday, July 16th at 10am in HS1.