Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to as Garbage in, garbage out (GIGO). GIGO expresses the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output1. High-performance AI applications require high-quality training and test data.
Research has divided data quality (DQ) into various dimensions, such as accuracy, consistency, and reputation. Despite extensive research into DQ dimensions over the last decades, there is no consensus on how these dimensions should be measured.
What is the goal of the seminar?
In this seminar, we will (1) introduce you to the field of data quality, (2) jointly investigate methods to measure different DQ dimensions, and (3) develop a framework that implements these methods (e.g., metrics, algorithms) in practice.
To achieve that, we have the following plan:
- Kickoff Phase: We will present an overview of the state of the art of DQ assessment and existing DQ dimensions. The entire group will discuss the envisioned architecture of the framework. Following this, each team (ideally 2 students) can select a DQ dimension (e.g., accuracy, completeness, representativity, relevancy) for implementation within the framework.
- Research: Each team will read related work to their selected DQ dimension and prepare a presentation for the entire group. The team should also propose their idea on how to implement the selected DQ dimension in the targeted framework.
- Deliverable: The teams will collaboratively write a paper-style technical report to present their developed DQ assessment framework, the selected DQ dimensions, and an experimental evaluation. The code of the framework itself and the experimental evaluation need to be provided as well.
- Bonus: You will learn how to read and write scientific research papers.