Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Description

According to Wikipedia, data profiling is the process of examining the data available in an existing data source [...] and collecting statistics and information about that data. It encompasses a vast array of methods to examine data sets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute usually involve multiple columns, such as inclusion dependencies or functional dependencies between columns. More advanced techniques detect approximate properties or conditional properties of the data set at hand. The first part of the lecture examines efficient detection methods for these properties.

Data profiling is relevant as a preparatory step to many use cases, such as query optimization, data mining, data integration, and data cleansing.

Many of the insights gained during data profiling point to deficiencies of the data. Profiling reveals data errors, such as inconsistent formatting within a column, missing values, or outliers. Profiling results can also be used to measure and monitor the general quality of a dataset, for instance by determining the number of records that do not conform to previously established constraints. The second part of the lecture examines various methods and algorithms to improve the quality of data, with an emphasis on the many existing duplicate detection approaches.

Additional information

  • Lectures are given in English.
  • Slides are available on the HPI-internal materials-folder.

Schedule

Schedule: Tuesdays and Thursdays 9:15 - 10:45 in HS 1

ATTENTION: The following schedule is subject to changes!

DateTopicSlides
DI 9.4.2013Introduction and motivationpdf
DO 11.4.2013Introduction to data profiling
DI 16.4.2013Exercise 1: Uniqueness detection
DO 18.4.2013Data profiling challenges and visionpdf
DI 23.4.2013Guestlecture: Arvid Heise
Unique column combinations
pdf
DO 25.4.2013Inclusion Dependenciespdf
DI 30.4.2013Guestlecture Jana Bauckmann: 
Conditional inclusion dependencies
pdf
DO 2.5.2013Exercise 2: Inclusion dependencies
DI 7.5.2013Guestlecture Yannick Saillet:
IBM Information Analyzer
DO 9.5.2013Christi Himmerfahrt
DI 14.5.2013no lecture
DO 16.5.2013Exercise 3: Functional dependencies
DI 21.5.2013Functional dependenciespdf
DO 23.5.2013Guestlecture Niels Weigel:
Data Profiling - Use Cases, Tools, and Solutions at SAP
DI 28.5.2013Guestlecture Anja Jentzsch:
Profiling linked open data
pdf
DO 30.5.2013Exercise 4: Presentations functional dependencies
DI 4.6.2013
ATTENTION: H-2.57
Data qualitypdf
DO 6.6.2013Duplicate detection
+ Handout for duplicate detection exercise
pdf
DI 11.6.2013Similarity measurespdf
DO 13.6.2013Similarity Measures
DI 18.6.2013no lecture
DO 20.6.2013Generic Entity Resolution with Swooshpdf
DI 25.6.2013Exercise 5: Presentations Duplicate Detection
DO 27.6.2013Exercise 6: Presentations Duplicate Detection
DI 2.7.2013
ATTENTION: H-E.51
Sorted Neighborhood Methodspdf
DO 4.7.2013no lecture
DI 9.7.2013Sorted Neighborhood Methods
DO 11.7.2013Joint Entity Resolutionpdf

Literature

The course does not follow a textbook. Each lecture references various scientific articles and other sources of information. Good sources to find those articles are

 

Below is a list of books that are of general interest to the lecture

Data Profiling (mostly book about data mining)

  • Jiawei Han, Micheline Kamber, Jian Pei: Data Mining: Concepts and Techniques
  • Dorian Pyle: Data Preparation for Data Mining

 

Data Cleansing

  • Ulf Leser und Felix Naumann: Informationsintegration, dpunkt Verlag, 2006.
    Das Buch steht vielfach in der Bibliothek und bei uns am Lehrstuhl. Außerdem z.B. bei Amazon.de.
  • Peter Christen: Data Matching, Springer
    Das Buch steht 10x in der Bibliothek
  • Alon Halevy et al "Information Integration"
  • Tamer Özsu and Patrick Valduriez "Distributed Database Systems"
  • Stefan Conrad "Föderierte Datenbanksysteme"

Exam

Tuesday, July 16th at 10am in HS1.