The Error Games: A Data Quality Challenge

Prof. Dr. Felix Naumann, Dr. Lisa Ehrlinger and Divya Bhadauria

Description

Error-free data sets are the foundation for the successful training of machine learning (ML) models and the use of artificial intelligence (AI). Much research has been done on classifying, describing, detecting, and cleaning data errors. The systematic generation of data errors, i.e., the pollution of data sets with known errors, is important to benchmark error detection and data cleaning algorithms. The aim of this seminar is to develop novel ideas for generating data errors that are as difficult as possible to detect. 

This seminar constitutes an adversarial challenge in which teams of two students compete against each other. Each team needs to (1) first generate data errors that are as hard as possible to detect and then (2) detect difficult data errors generated by other teams. We will initially introduce the field of data quality, followed by a list of the most common types of data errors, various technologies for noise pollution, like data synthesis and perturbation, and techniques for error detection, like statistical and ML methods. Together, we will select a number of interesting data errors to focus on in this competition.

What are the goals of the seminar?

  • Learn about the research area of data quality and data errors
  • Read and understand scientific papers
  • Develop novel ideas on how to generate and detect data errors
  • Jointly discuss properties of “hard to detect” data errors (e.g., whether this correlates with the extent to which an error appears to be realistic)
  • Present results in written and oral form

The data error seminar will be organized in the following four phases:

  • Kickoff: We will provide clean data sets and jointly select the data error types that we will focus on for the competition. 
  • Research: Each team will select a (subset) of the defined data error(s), read related work, and prepare a presentation for the entire group to provide examples and describe what constitutes this kind of error and what not.
  • Data error challenge: The challenge itself will be carried out in two phases:
    • Phase 1: Each team will pollute the input data sets with errors that appear to be realistic and should be as difficult to detect as possible. As a result, for each input data set, (1) a polluted data set, as well as (2) a set of labels with the polluted data errors, should be generated. 
    • Phase 2: Each team will receive the polluted data sets from the other teams. The goal is to detect as many data errors as possible.
  • Deliverable: At the end of the seminar, each team will prepare a presentation about (1) their used data error generation strategy, (2) the error detection technique used as well as (3) the percentage of errors found by the respective other group.

Prerequisites

For this seminar, participants need to be able to program fluently in Python and know how to use Jupyter notebooks.

Organization

The organizational details for this seminar are as follows:

  • Project seminar for master students
  • Language of instruction: English
  • 6 credit points, 4 SWS
  • At most 10 participants (ideally, 5 teams of 2 students each)

Time Table

Our meetings are currently scheduled for TBD. In our first meeting, we will discuss on possible alternative times that are suitable for everyone.

The following timetable lists the main semester milestones and it still tentative.

Date

Time

Room

Topic

Slides

09.04.2025 13:30 - 15:00 F-E.06 Introduction Slides
16.04.2025 13:30 - 15:00 F-E.06 Group allocation and data error selection Slides
23.04.2025 13:30 - 15:00 F-E.06 1-on-1 meeting with each group for questions and guidance  
30.04.2035 13:30 - 15:00 F-E.06 1-on-1 meeting with each group for questions and guidance  
07.05.2025 13:30 - 15:00 F-E.06 1-on-1 meeting with each group for questions and guidance  
14.05.2025 13:30 - 15:00 F-E.06 Mid-term presentation (Ideas & Approach)  
21.05.2025 14:00 - 15:00 F-E.06 General Discussion  
28.05.2025 13:30 - 15:00 F-E.06 Meeting and progress reports Slides
04.06.2025 13:30 - 15:00 F-E.06 Meeting and progress reports Slides
11.06.2025 13:30 - 15:00 F-E.06 Meeting and progress reports Slides
18.06.2025 13:30 - 15:00 F-E.06 Meeting and progress reports Slides
25.06.2025 13:30 - 15:00 F-E.06 No meeting  
02.07.2025 13:30 - 15:00 F-E.06 1-on-1 meeting with each group for questions and guidance Slides
09.07.2025 13:00 - 15:00 F-E.06 End-term presentation  
16.07.2025 14:00 - 15:00 F-E.06 Feedback and award ceremony Slides
08.08.2025 13:30 - 15:00 TBD Final submission   

 

 

Literature

To get introduced about data errors and its various types, you can start with reading the following literature that you can find on dblp or google-scholar:

Data errors literature

 

Grading

The final grade is weighted by 6 LP and considers the following:

  • (15%) Active participation in meetings and discussions
  • (15%) Technical presentation of DQ dimension (existing research plus own idea)
  • (20%) Mid- and End-term presentation
  • (20%) Quality of implementation and results
  • (30%) Final paper-style submission