The Error Games: A Data Quality Challenge

Prof. Dr. Felix Naumann, Dr. Lisa Ehrlinger and Divya Bhadauria

Description

Error-free data sets are the foundation for the successful training of machine learning (ML) models and the use of artificial intelligence (AI). Much research has been done on classifying, describing, detecting, and cleaning data errors. The systematic generation of data errors, i.e., the pollution of data sets with known errors, is important to benchmark error detection and data cleaning algorithms. The aim of this seminar is to develop novel ideas for generating data errors that are as difficult as possible to detect.

This seminar constitutes an adversarial challenge in which teams of two students compete against each other. Each team needs to (1) first generate data errors that are as hard as possible to detect and then (2) detect difficult data errors generated by other teams. We will initially introduce the field of data quality, followed by a list of the most common types of data errors, various technologies for noise pollution, like data synthesis and perturbation, and techniques for error detection, like statistical and ML methods. Together, we will select a number of interesting data errors to focus on in this competition.

What are the goals of the seminar?

Learn about the research area of data quality and data errors
Read and understand scientific papers
Develop novel ideas on how to generate and detect data errors
Jointly discuss properties of “hard to detect” data errors (e.g., whether this correlates with the extent to which an error appears to be realistic)
Present results in written and oral form

The data error seminar will be organized in the following four phases:

Kickoff: We will provide clean data sets and jointly select the data error types that we will focus on for the competition.
Research: Each team will select a (subset) of the defined data error(s), read related work, and prepare a presentation for the entire group to provide examples and describe what constitutes this kind of error and what not.
Data error challenge: The challenge itself will be carried out in two phases:
- Phase 1: Each team will pollute the input data sets with errors that appear to be realistic and should be as difficult to detect as possible. As a result, for each input data set, (1) a polluted data set, as well as (2) a set of labels with the polluted data errors, should be generated.
- Phase 2: Each team will receive the polluted data sets from the other teams. The goal is to detect as many data errors as possible.
Deliverable: At the end of the seminar, each team will prepare a presentation about (1) their used data error generation strategy, (2) the error detection technique used as well as (3) the percentage of errors found by the respective other group.

Prerequisites

For this seminar, participants need to be able to program fluently in Python and know how to use Jupyter notebooks.

Organization

The organizational details for this seminar are as follows:

Project seminar for master students
Language of instruction: English
6 credit points, 4 SWS
At most 10 participants (ideally, 5 teams of 2 students each)

Time Table

Our meetings are currently scheduled for TBD. In our first meeting, we will discuss on possible alternative times that are suitable for everyone.

The following timetable lists the main semester milestones and it still tentative.

Date	Time	Room	Topic	Slides
09.04.2025	13:30 - 15:00	F-E.06	Introduction	Slides
16.04.2025	13:30 - 15:00	F-E.06	Group allocation and data error selection	Slides
23.04.2025	13:30 - 15:00	F-E.06	1-on-1 meeting with each group for questions and guidance
30.04.2035	13:30 - 15:00	F-E.06	1-on-1 meeting with each group for questions and guidance
07.05.2025	13:30 - 15:00	F-E.06	1-on-1 meeting with each group for questions and guidance
14.05.2025	13:30 - 15:00	F-E.06	Mid-term presentation (Ideas & Approach)
21.05.2025	14:00 - 15:00	F-E.06	General Discussion
28.05.2025	13:30 - 15:00	F-E.06	Meeting and progress reports	Slides
04.06.2025	13:30 - 15:00	F-E.06	Meeting and progress reports	Slides
11.06.2025	13:30 - 15:00	F-E.06	Meeting and progress reports	Slides
18.06.2025	13:30 - 15:00	F-E.06	Meeting and progress reports	Slides
25.06.2025	13:30 - 15:00	F-E.06	No meeting
02.07.2025	13:30 - 15:00	F-E.06	1-on-1 meeting with each group for questions and guidance	Slides
09.07.2025	13:00 - 15:00	F-E.06	End-term presentation
16.07.2025	14:00 - 15:00	F-E.06	Feedback and award ceremony	Slides
08.08.2025	13:30 - 15:00	TBD	Final submission

Literature

To get introduced about data errors and its various types, you can start with reading the following literature that you can find on dblp or google-scholar:

Data errors literature

João Marcelo Borovina Josko, Marcio Katsumi Oikawa, and João Eduardo Ferreira, "A Formal Taxonomy to Improve Data Defect Description". DASFAA Workshops, 2016: 307-320.
Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, "A Formal Definition of Data Quality Problems". ICIQ, 2005.
Theresia Gschwandtner, Johannes Gärtner, Wolfgang Aigner & Silvia Miksch, "A Taxonomy of Dirty Time-Oriented Data". CD-ARES, 2012: 58-72

Grading

The final grade is weighted by 6 LP and considers the following:

(15%) Active participation in meetings and discussions
(15%) Technical presentation of DQ dimension (existing research plus own idea)
(20%) Mid- and End-term presentation
(20%) Quality of implementation and results
(30%) Final paper-style submission