DQ4AI: Data Quality Assessment

Dr. Lisa Ehrlinger and Sedir Mohammed

Description

Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to as Garbage in, garbage out (GIGO). GIGO expresses the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output1. High-performance AI applications require high-quality training and test data.

Research has divided data quality (DQ) into various dimensions, such as accuracy, consistency, and reputation. Despite extensive research into DQ dimensions over the last decades, there is no consensus on how these dimensions should be measured.

What is the goal of the seminar?

In this seminar, we will (1) introduce you to the field of data quality, (2) jointly investigate methods to measure different DQ dimensions, and (3) develop a framework that implements these methods (e.g., metrics, algorithms) in practice.

To achieve that, we have the following plan:

  • Kickoff Phase: We will present an overview of the state of the art of DQ assessment and existing DQ dimensions.  The entire group will discuss the envisioned architecture of the framework. Following this, each team (ideally 2 students) can select a DQ dimension (e.g., accuracy, completeness, representativity, relevancy) for implementation within the framework.
  • Research: Each team will read related work to their selected DQ dimension and prepare a presentation for the entire group. The team should also propose their idea on how to implement the selected DQ dimension in the targeted framework.
  • Deliverable: The teams will collaboratively write a paper-style technical report to present their developed DQ assessment framework, the selected DQ dimensions, and an experimental evaluation. The code of the framework itself and the experimental evaluation need to be provided as well.
  • Bonus: You will learn how to read and write scientific research papers.

Prerequisites

For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks.

Organization

The organizational details for this seminar are as follows:

  • Project seminar for master students
  • Language of instruction: English
  • 6 credit points, 4 SWS
  • At most 10 participants (ideally, 5 teams of 2 students each)

Time Table

Our meetings are currently scheduled for Mondays from 17.00 to 18.30 in Campus II, Building F, in Room F-2.11. In our first meeting, we will discuss on possible alternative times that are suitable for everyone.

The following timetable lists the main semester milestones and it still tentative.

Date

Time Room

Topic

Slides
14.10.2024 17.00-18.30 F-2.11 Introduction Slides
21.10.2024 17.00-18.30 F-2.11 Group allocation and topic selection  
22.10.2024 15.15-16.45 F-E.06 "How to read a research paper" Slides
04.11.2024 17.00-18.30 F-2.11 Report on progress and questions  
18.11.2024 17:00-18:30 F-2.11 Report on progress and questions  
25.11.2024 17:00-18:30 F-2.11 Report on progress and questions  
02.12.2024 17:00-18:30 F-2.10 Mid-term presentation  
09.12.2024 17:00-18:30 F-2.11 Report on progress and questions  
16.12.2024 17:00-18:30 F-2.11 Report on progress and questions  
06.01.2025 17:00-18:30 F-2.11 Report on progress and questions  
20.01.2025 17:00-18:30 F-2.11 Report on progress and questions  
14.02.2025 13:30-14:30 F-2.10 End-term presentation  
01.03.2025 23:59   Final submission  

 

 

Literature

To get introduced to data quality and its various dimensions, you can start with reading the following literature that you can find on dblp or google-scholar:

How to read a paper

Data quality literature

Grading

The final grade is weighted by 6 LP and considers the following:

  • (15%) Active participation in meetings and discussions
  • (15%) Technical presentation of DQ dimension (existing research plus own idea)
  • (20%) Mid- and End-term presentation
  • (20%) Quality of implementation and results
  • (30%) Final paper-style submission