Many AI methods are dependent on large quantities of suitable training data. This creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to by Garbage in, garbage out (GIGO). GIGO is used to express the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output1. High-performance AI applications require high-quality training and test data.
Why revisit data quality measures?
The traditional definition of data or information quality includes dimensions, such as validity, accuracy, completeness, consistency, and uniformity. Nevertheless, this long-established definition of data quality does not yet consider modern AI systems and their requirements. Furthermore, there is not much research on the explainability of machine learning models in terms of the quality of the training/testing data.
What is the goal of the seminar?
In this seminar, we will introduce you to the field of data quality and explore together the correlation between data quality and AI model performance. To achieve that, we have the following plan:
- Kickoff Phase: Each team ideally consists of 2 students and will be assigned a specific task: classification, regression, etc. Your part is to choose one or more representative models to solve this task with the respective datasets (see datasets section).
- Research: Each team will explore the effect of reducing the quality of the data, concerning different quality dimensions, on the performance of the chosen models. we will provide you with state-of-the-art papers in the field of data quality for AI. More details about the dimensions and experimental setup will be provided at the beginning of this phase.
- Deliverable: The outcome of the seminar is a paper-style technical report that the teams will write collaboratively to present the results of the conducted analysis. In addition to the code, models, and the clean/polluted datasets that have been produced.
- Bonus: You will learn how to read/write a research paper and how to conduct scientific experiments and present the results in a paper.