Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

DQ4AI: Data Quality Assessment

Description

Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to as Garbage in, garbage out (GIGO). GIGO expresses the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output1. High-performance AI applications require high-quality training and test data.

Research has divided data quality (DQ) into various dimensions, such as accuracy, consistency, and reputation. Despite extensive research into DQ dimensions over the last decades, there is no consensus on how these dimensions should be measured.

What is the goal of the seminar?

In this seminar, we will (1) introduce you to the field of data quality, (2) jointly investigate methods to measure different DQ dimensions, and (3) develop a framework that implements these methods (e.g., metrics, algorithms) in practice.

To achieve that, we have the following plan:

  • Kickoff Phase: We will present an overview of the state of the art of DQ assessment and existing DQ dimensions.  The entire group will discuss the envisioned architecture of the framework. Following this, each team (ideally 2 students) can select a DQ dimension (e.g., accuracy, completeness, representativity, relevancy) for implementation within the framework.
  • Research: Each team will read related work to their selected DQ dimension and prepare a presentation for the entire group. The team should also propose their idea on how to implement the selected DQ dimension in the targeted framework.
  • Deliverable: The teams will collaboratively write a paper-style technical report to present their developed DQ assessment framework, the selected DQ dimensions, and an experimental evaluation. The code of the framework itself and the experimental evaluation need to be provided as well.
  • Bonus: You will learn how to read and write scientific research papers.

Prerequisites

For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks.

Organization

The organizational details for this seminar are as follows:

  • Project seminar for master students
  • Language of instruction: English
  • 6 credit points, 4 SWS
  • At most 10 participants (ideally, 5 teams of 2 students each)

    Time Table

    Our meetings are currently scheduled for Mondays from 17.00 to 18.30 in Campus II, Building F, in Room F-2.11. In our first meeting, we will discuss on possible alternative times that are suitable for everyone.

    The following timetable lists the main semester milestones and it still tentative.

    Date

    TimeRoom

    Topic

    Slides
    14.10.202417.00-18.30F-2.11IntroductionSlides
    21.10.202417.00-18.30F-2.11Group allocation and topic selectionSlides
    22.10.202415.15-16.45F-E.06"how to read a research paper" 
    04.11.202417.00-18.30F-2.11Report on progress and questions 
    18.11.202417:00-18:30F-2.11Report on progress and questions 
    25.11.202417:00-18:30F-2.11Report on progress and questions 
    02.12.202417:00-18:30F-2.10Mid-term presentation 
    09.12.202417:00-18:30F-2.11Report on progress and questions 
    16.12.202417:00-18:30F-2.11Report on progress and questions 
    06.01.202517:00-18:30F-2.11Report on progress and questions 
    20.01.202517:00-18:30F-2.11Report on progress and questions 
    03.02.202517:00-18:30F-2.10End-term presentation 
    01.03.202523:59 Final submission 

     

     

    Literature

    To get introduced to data quality and its various dimensions, you can start with reading the following literature that you can find on dblp or google-scholar:

    How to read a paper

    Data quality literature

    Grading

    The final grade is weighted by 6 LP and considers the following:

    • (15%) Active participation in meetings and discussions
    • (15%) Technical presentation of DQ dimension (existing research plus own idea)
    • (20%) Mid- and End-term presentation
    • (20%) Quality of implementation and results
    • (30%) Final paper-style submission