Data Profiling and Data Cleansing
Description
According to Wikipedia, data profiling is the process of examining the data available in an existing data source [...] and collecting statistics and information about that data. It encompasses a vast array of methods to examine data sets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute usually involve multiple columns, such as inclusion dependencies or functional dependencies between columns. More advanced techniques detect approximate properties or conditional properties of the data set at hand. The first part of the lecture examines efficient detection methods for these properties.
Data profiling is relevant as a preparatory step to many use cases, such as query optimization, data mining, data integration, and data cleansing.
Many of the insights gained during data profiling point to deficiencies of the data. Profiling reveals data errors, such as inconsistent formatting within a column, missing values, or outliers. Profiling results can also be used to measure and monitor the general quality of a dataset, for instance by determining the number of records that do not conform to previously established constraints. The second part of the lecture examines various methods and algorithms to improve the quality of data, with an emphasis on the many existing duplicate detection approaches.
Additional information
- Lectures are given in English.
- Slides are available on the HPI-internal materials-folder.
Schedule
Schedule: Tuesdays and Thursdays 9:15 - 10:45 in HS 1
ATTENTION: The following schedule is subject to changes!
| Date | Topic | Slides |
|---|---|---|
| DI 9.4.2013 | Introduction and motivation | |
| DO 11.4.2013 | Introduction to data profiling | |
| DI 16.4.2013 | Exercise 1: Uniqueness detection | |
| DO 18.4.2013 | Data profiling challenges and vision | |
| DI 23.4.2013 | Guestlecture: Arvid Heise Unique column combinations | |
| DO 25.4.2013 | Inclusion Dependencies | |
| DI 30.4.2013 | Guestlecture Jana Bauckmann: Conditional inclusion dependencies | |
| DO 2.5.2013 | Exercise 2: Inclusion dependencies | |
| DI 7.5.2013 | Guestlecture Yannick Saillet: IBM Information Analyzer | |
| DO 9.5.2013 | Christi Himmerfahrt | |
| DI 14.5.2013 | no lecture | |
| DO 16.5.2013 | Exercise 3: Functional dependencies | |
| DI 21.5.2013 | Functional dependencies | |
| DO 23.5.2013 | Guestlecture Niels Weigel: Data Profiling - Use Cases, Tools, and Solutions at SAP | |
| DI 28.5.2013 | Guestlecture Anja Jentzsch: Profiling linked open data | |
| DO 30.5.2013 | Exercise 4: Presentations functional dependencies | |
| DI 4.6.2013 ATTENTION: H-2.57 | Data quality | |
| DO 6.6.2013 | Duplicate detection + Handout for duplicate detection exercise | |
| DI 11.6.2013 | Similarity measures | |
| DO 13.6.2013 | Similarity Measures | |
| DI 18.6.2013 | no lecture | |
| DO 20.6.2013 | Generic Entity Resolution with Swoosh | |
| DI 25.6.2013 | Exercise 5: Presentations Duplicate Detection | |
| DO 27.6.2013 | Exercise 6: Presentations Duplicate Detection | |
| DI 2.7.2013 ATTENTION: H-E.51 | Sorted Neighborhood Methods | |
| DO 4.7.2013 | no lecture | |
| DI 9.7.2013 | Sorted Neighborhood Methods | |
| DO 11.7.2013 | Joint Entity Resolution |
Literature
The course does not follow a textbook. Each lecture references various scientific articles and other sources of information. Good sources to find those articles are
- DBLP
- ACM's Digital Library
- Google Scholar
- Author's homepages
Below is a list of books that are of general interest to the lecture
Data Profiling (mostly book about data mining)
- Jiawei Han, Micheline Kamber, Jian Pei: Data Mining: Concepts and Techniques
- Dorian Pyle: Data Preparation for Data Mining
Data Cleansing
- Ulf Leser und Felix Naumann: Informationsintegration, dpunkt Verlag, 2006.
Das Buch steht vielfach in der Bibliothek und bei uns am Lehrstuhl. Außerdem z.B. bei Amazon.de. - Peter Christen: Data Matching, Springer
Das Buch steht 10x in der Bibliothek - Alon Halevy et al "Information Integration"
- Tamer Özsu and Patrick Valduriez "Distributed Database Systems"
- Stefan Conrad "Föderierte Datenbanksysteme"
Exam
Tuesday, July 16th at 10am in HS1.