Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Introduction

If you are interested in participating, please attend the initial session and reach out to fabian.panse(at)hpi.de until October 22.

Please do not hesitate to contact us if you are interested, but the current time slot does not fit your schedule. In this case, please include a note that the current time does not fit you well. We would try to reschedule our meetings to allow more students to participate.

Description

The integration of data has been a highly regarded field of research for decades and with increasing digitization, it is becoming more and more important. However, such integration has to overcome many hurdles and deal with a variety of problems. Among others, these problems include schema matching, duplicate detection, and record fusion.

The quality of data has a critical influence on the result of its processing, and thus directly affects important real world processes, such as business decisions or patient treatment. Therefore data cleansing is an important aspect of data management. The goal of a data cleansing process is to detect and correct errors in data. Errors can affect single attribute values, but can also span multiple attribute values, records, or even tables. These can be typos, phonetic errors, OCR errors, semantic errors, missing values or records, incorrect formatting, incorrect abstraction levels (e.g., district instead of city), obsolete values, incorrect foreign key references, or swaps between two attribute values. Detecting errors can be done in several ways. Among other things, this can be accomplished using statistical methods, but also with the help of integrity conditions.

Procedure

In this seminar, we will read, discuss, evaluate, and write summaries of recent papers on various data cleaning and data integration topics. Among others, these topics include constraint-based error detection, schema matching, duplicate detection, and record fusion.

Each participant will be assigned a specific topic at the beginning along with basic literature. Based on this, the participant studies the topic, gives a presentation on it, and writes a report.