Data Cleaning & Integration

Prof. Dr. Felix Naumann, Dr. Fabian Panse, and Dr. Matteo Paganelli

Introduction

If you are interested in participating, please attend the initial session and reach out to fabian.panse(at)hpi.de until October 20.

Please do not hesitate to contact us if you are interested, but the current time slot does not fit your schedule. In this case, please include a note that the current time does not fit you well. We would try to reschedule our meetings to allow more students to participate.

Description

The integration of data has been a highly regarded field of research for decades and with increasing digitization, it is becoming more and more important. However, such integration has to overcome many hurdles and deal with a variety of problems. Among others, these problems include schema matching, duplicate detection, and record fusion.

The quality of data has a critical influence on the result of its processing, and thus directly affects important real world processes, such as business decisions or patient treatment. Therefore data cleansing is an important aspect of data management. The goal of a data cleansing process is to detect and correct errors in data. Errors can affect single attribute values, but can also span multiple attribute values, records, or even tables. These can be typos, phonetic errors, OCR errors, semantic errors, missing values or records, incorrect formatting, incorrect abstraction levels (e.g., district instead of city), obsolete values, incorrect foreign key references, or swaps between two attribute values. Detecting errors can be done in several ways. Among other things, this can be accomplished using statistical methods, but also with the help of integrity conditions.

Procedure

In this seminar, we will read, discuss, evaluate, and write summaries of recent papers on various data cleaning and data integration topics. Among others, these topics include constraint-based error detection, schema matching, and duplicate detection.

Each participant will be assigned a specific topic at the beginning along with basic literature. Based on this, the participant studies the topic, gives a presentation on it, and writes a report.

In addition, each participant will write a review for 2-3 reports written by some of the other participants.

Organization

Seminar for master students
Language: English
Maximum number of participants: 8

On the first appointment on 16.10.2023 at 3:15 p.m. in F-E.06, we will give an introduction to the seminar and its topics. This session will be open for all of you.
Afterwards, we request you to register for this seminar firm until 20.10.2023 11:59 a.m. by sending an informal e-mail to fabian.panse(at)hpi.de with the subject: "Registration to Data Cleaning and Integration seminar". The email should include any prior knowledge of you that is relevant to this course (e.g., HPI courses in the data engineering, distributed computing, or machine learning area) and the topic you want to work on.
We will notify the selected participants on Friday, 20.10.2023, in the afternoon.

Grading

The grading will be based on the following parts:

Quality of seminar report (40%)
Quality of the reviews of other seminar reports (30%)
Quality of the seminar presentation (30%)

Time Table

Unless otherwise specified, the seminar is always Mondays at 3:15 p.m. in room F-E.06.

Date	Topic
2023-10-16	Seminar introduction
2023-10-20 11:59 a.m.	Participation feedback and topic requests (online)
2023-10-20 6:00 p.m.	Notification of participation (online)
2023-10-23	Topic assignements and first discussions
2023-10-30 - 2023-12-18	Weekly meetings and progress reports
2023-12-25	Christmas break
2023-01-01	New Years break
2024-01-08	Submission of the seminar papers and review assignments
2024-01-15	Weekly meetings and progress reports
2024-01-22	Submission and discussion of paper reviews
2024-01-29	Seminar presentations
2024-02-05	Seminar presentations

Slides

Seminar_Data_Cleaning_and_Integration_Introduction.pdf