Data Profiling and Data Cleansing
Description
According to Wikipedia, data profiling is the process of examining the data available in an existing data source [...] and collecting statistics and information about that data. It encompasses a vast array of methods to examine data sets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute usually involve multiple columns, such as inclusion dependencies or functional dependencies between columns. More advanced techniques detect approximate properties or conditional properties of the data set at hand. The first part of the lecture examines efficient detection methods for these properties.
Data profiling is relevant as a preparatory step to many use cases, such as query optimization, data mining, data integration, and data cleansing.
Many of the insights gained during data profiling point to deficiencies of the data. Profiling reveals data errors, such as inconsistent formatting within a column, missing values, or outliers. Profiling results can also be used to measure and monitor the general quality of a dataset, for instance by determining the number of records that do not conform to previously established constraints. The second part of the lecture examines various methods and algorithms to improve the quality of data, with an emphasis on the many existing duplicate detection approaches.
Additional information
- Lectures are given in English.
- Slides are available on the HPI-internal materials-folder.
- This lecture is a repetition from summer 2013.
Schedule
Schedule: Mondays at 15:15 and Thursdays at 11:00 in HS 2
The lecture is held in German, this year. It is recorded on tele-task.
ATTENTION: The following schedule is subject to change.
| Date | Topic | Slides |
|---|---|---|
| MO 13.10. | Big Data Introduction | |
| TH 16.10. | Exercise: Metanome and UCCs | slides, task, demo |
| MO 20.10. | Data Profiling Introduction | |
| TH 23.10. | Data profiling challenges and outlook | |
| MO 27.10. | Guest lecture: Unique column combinations (Arvid Heise) | |
| TH 30.10. | no lecture | |
| MO 03.11. | no lecture | |
| TH 06.11. | no lecture | |
| MO 10.11. | Inclusion dependencies | |
| TH 13.11. | Conditional inclusion dependencies | |
| MO 17.11. | Excercise: UCCs and INDs | slides, task |
| TH 20.11. | Thorsten Papenbrock & Sebastian Kruse: Advanced IND Detection Methods | |
| MO 24.11. | Functional dependencies | |
| TH 27.11. | Guest lecture: Yannick Saillet (IBM) | |
| MO 01.12. | Functional dependencies | |
| TH 04.12. | Jens Ehrlich & Fabian Tschirschnitz: Conditional Uniques & IND Detection at Scale | |
| MO 08.12. | Excercise: INDs and FDs | slides, task |
| TH 11.12. | Functional dependencies | |
| MO 15.12. | Introduction to Data Quality | |
| TH 18.12. | Duplicate Detection | |
| Christmas holidays | ||
| MO 05.01. | Similarity measures | |
| TH 08.01. | Similarity measures | |
| MO 12.01. | Excercise: FDs and Duplicate Detection | slides, task |
| TH 15.01. | Sorted Neighborhood Methods | |
| MO 19.01. | Sorted Neighborhood Methods | |
| TH 22.01. | no lecture | |
| MO 26.01. | Generic Entity Resolution | |
| TH 29.01. | Anja Jentzsch: Profiling Linked Open Data | |
| MO 02.02. | Excercise: Duplicate Detection | slides |
| TH 05.02. | Exam preparation | |
| WED 11.02. (10 - 12am) | Exam in HS 1 |
Literature
The course does not follow a textbook. Each lecture references various scientific articles and other sources of information. Good sources to find those articles are
- DBLP
- ACM's Digital Library
- Google Scholar
- Author's homepages
Below is a list of books that are of general interest to the lecture
Data Profiling (mostly book about data mining)
- A short introductory article: Felix Naumann, Data Profiling Revisited, SIGMOD Record 2013
- Jiawei Han, Micheline Kamber, Jian Pei: Data Mining: Concepts and Techniques
- Dorian Pyle: Data Preparation for Data Mining
Data Cleansing
- Ulf Leser und Felix Naumann: Informationsintegration, dpunkt Verlag, 2006.
Das Buch steht vielfach in der Bibliothek und bei uns am Lehrstuhl. Außerdem z.B. bei Amazon.de. - Peter Christen: Data Matching, Springer
Das Buch steht 10x in der Bibliothek - Alon Halevy et al "Information Integration"
- Tamer Özsu and Patrick Valduriez "Distributed Database Systems"
- Stefan Conrad "Föderierte Datenbanksysteme"
Exam
A written exam will take place on February 11 from 10am until noon in HS 1.