Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

DQ4AI: Data Quality Assessment

Description

Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to as Garbage in, garbage out (GIGO). GIGO expresses the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output1. High-performance AI applications require high-quality training and test data.

Research has divided data quality (DQ) into various dimensions, such as accuracy, consistency, and reputation. Despite extensive research into DQ dimensions over the last decades, there is no consensus on how these dimensions should be measured.

What is the goal of the seminar?

In this seminar, we will (1) introduce you to the field of data quality, (2) jointly investigate methods to measure different DQ dimensions, and (3) develop a framework that implements these methods (e.g., metrics, algorithms) in practice.

To achieve that, we have the following plan:

  • Kickoff Phase: We will present an overview of the state of the art of DQ assessment and existing DQ dimensions.  The entire group will discuss the envisioned architecture of the framework. Following this, each team (ideally 2 students) can select a DQ dimension (e.g., accuracy, completeness, representativity, relevancy) for implementation within the framework.
  • Research: Each team will read related work to their selected DQ dimension and prepare a presentation for the entire group. The team should also propose their idea on how to implement the selected DQ dimension in the targeted framework.
  • Deliverable: The teams will collaboratively write a paper-style technical report to present their developed DQ assessment framework, the selected DQ dimensions, and an experimental evaluation. The code of the framework itself and the experimental evaluation need to be provided as well.
  • Bonus: You will learn how to read and write scientific research papers.

Prerequisites

For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks.

Organization

The organizational details for this seminar are as follows:

  • Project seminar for master students
  • Language of instruction: English
  • 6 credit points, 4 SWS
  • At most 8 participants (ideally, 4 teams of 2 students each)

    Time Table

    Our bi-weekly meetings are currently scheduled for Mondays from 17.00 to 18.30 in Campus II, Building F, in Room F-2.11. In our first meeting, we will discuss on possible alternative times that are suitable for everyone.

    The following timetable lists the main semester milestones and it still tentative 

    Date

    Room

    Topic

    Slides

    14.10.2024

    F-2.11Introduction incl. group allocation and topic selection 
    21.10.2024F-2.11Info session: "how to read a paper?" 
    28.10.2024F-2.11Discussion of framework 
    04.11.2024F-2.11Report on progress and questions 
    18.11.2024F-2.11Mid-term presentation of DQ dimensions 
    09.12.2024F-2.11Report on progress and questions 
    06.01.2024F-2.11Report on progress and questions 
    13.01.2024F-2.11End-term presentation of DQ dimension implementation 
    27.01.2025F-2.11Final submission 

     

     

    Literature

    To get introduced to data quality and its various dimensions, you can start with reading the following literature that you can find on dblp or google-scholar:

    How to read a paper

    Data quality literature

    Grading

    The final grade is weighted by 6 LP and considers the following:

    • (15%) Active participation in meetings and discussions
    • (15%) Technical presentation of DQ dimension (existing research plus own idea)
    • (20%) Mid- and End-term presentation
    • (20%) Quality of implementation and results
    • (30%) Final paper-style submission