The Error Games: A Data Quality Challenge
Prof. Dr. Felix Naumann, Dr. Lisa Ehrlinger and Divya Bhadauria
Description
Error-free data sets are the foundation for the successful training of machine learning (ML) models and the use of artificial intelligence (AI). Much research has been done on classifying, describing, detecting, and cleaning data errors. The systematic generation of data errors, i.e., the pollution of data sets with known errors, is important to benchmark error detection and data cleaning algorithms. The aim of this seminar is to develop novel ideas for generating data errors that are as difficult as possible to detect.
This seminar constitutes an adversarial challenge in which teams of two students compete against each other. Each team needs to (1) first generate data errors that are as hard as possible to detect and then (2) detect difficult data errors generated by other teams. We will initially introduce the field of data quality, followed by a list of the most common types of data errors, various technologies for noise pollution, like data synthesis and perturbation, and techniques for error detection, like statistical and ML methods. Together, we will select a number of interesting data errors to focus on in this competition.
What are the goals of the seminar?
- Learn about the research area of data quality and data errors
- Read and understand scientific papers
- Develop novel ideas on how to generate and detect data errors
- Jointly discuss properties of “hard to detect” data errors (e.g., whether this correlates with the extent to which an error appears to be realistic)
- Present results in written and oral form
The data error seminar will be organized in the following four phases:
- Kickoff: We will provide clean data sets and jointly select the data error types that we will focus on for the competition.
- Research: Each team will select a (subset) of the defined data error(s), read related work, and prepare a presentation for the entire group to provide examples and describe what constitutes this kind of error and what not.
- Data error challenge: The challenge itself will be carried out in two phases:
- Phase 1: Each team will pollute the input data sets with errors that appear to be realistic and should be as difficult to detect as possible. As a result, for each input data set, (1) a polluted data set, as well as (2) a set of labels with the polluted data errors, should be generated.
- Phase 2: Each team will receive the polluted data sets from the other teams. The goal is to detect as many data errors as possible.
- Deliverable: At the end of the seminar, each team will prepare a presentation about (1) their used data error generation strategy, (2) the error detection technique used as well as (3) the percentage of errors found by the respective other group.
Prerequisites
For this seminar, participants need to be able to program fluently in Python and know how to use Jupyter notebooks.
Organization
The organizational details for this seminar are as follows:
- Project seminar for master students
- Language of instruction: English
- 6 credit points, 4 SWS
- At most 10 participants (ideally, 5 teams of 2 students each)
Time Table
Our meetings are currently scheduled for TBD. In our first meeting, we will discuss on possible alternative times that are suitable for everyone.
The following timetable lists the main semester milestones and it still tentative.
Date | Time | Room | Topic | Slides |
| 09.04.2025 | 13:30 - 15:00 | F-E.06 | Introduction | Slides |
| 16.04.2025 | 13:30 - 15:00 | F-E.06 | Group allocation and data error selection | Slides |
| 23.04.2025 | 13:30 - 15:00 | F-E.06 | 1-on-1 meeting with each group for questions and guidance | |
| 30.04.2035 | 13:30 - 15:00 | F-E.06 | 1-on-1 meeting with each group for questions and guidance | |
| 07.05.2025 | 13:30 - 15:00 | F-E.06 | 1-on-1 meeting with each group for questions and guidance | |
| 14.05.2025 | 13:30 - 15:00 | F-E.06 | Mid-term presentation (Ideas & Approach) | |
| 21.05.2025 | 14:00 - 15:00 | F-E.06 | General Discussion | |
| 28.05.2025 | 13:30 - 15:00 | F-E.06 | Meeting and progress reports | Slides |
| 04.06.2025 | 13:30 - 15:00 | F-E.06 | Meeting and progress reports | Slides |
| 11.06.2025 | 13:30 - 15:00 | F-E.06 | Meeting and progress reports | Slides |
| 18.06.2025 | 13:30 - 15:00 | F-E.06 | Meeting and progress reports | Slides |
| 25.06.2025 | 13:30 - 15:00 | F-E.06 | No meeting | |
| 02.07.2025 | 13:30 - 15:00 | F-E.06 | 1-on-1 meeting with each group for questions and guidance | Slides |
| 09.07.2025 | 13:00 - 15:00 | F-E.06 | End-term presentation | |
| 16.07.2025 | 14:00 - 15:00 | F-E.06 | Feedback and award ceremony | Slides |
| 08.08.2025 | 13:30 - 15:00 | TBD | Final submission |
Literature
To get introduced about data errors and its various types, you can start with reading the following literature that you can find on dblp or google-scholar:
Data errors literature
- João Marcelo Borovina Josko, Marcio Katsumi Oikawa, and João Eduardo Ferreira, "A Formal Taxonomy to Improve Data Defect Description". DASFAA Workshops, 2016: 307-320.
- Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, "A Formal Definition of Data Quality Problems". ICIQ, 2005.
- Theresia Gschwandtner, Johannes Gärtner, Wolfgang Aigner & Silvia Miksch, "A Taxonomy of Dirty Time-Oriented Data". CD-ARES, 2012: 58-72
Grading
The final grade is weighted by 6 LP and considers the following:
- (15%) Active participation in meetings and discussions
- (15%) Technical presentation of DQ dimension (existing research plus own idea)
- (20%) Mid- and End-term presentation
- (20%) Quality of implementation and results
- (30%) Final paper-style submission