Data Quality for AI
Advisors
Prof. Dr. Felix Naumann and Dr. Hazar Harmouch
Description
Many AI methods are dependent on large quantities of suitable training data. This creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to by Garbage in, garbage out (GIGO). GIGO is used to express the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output1. High-performance AI applications require high-quality training and test data.
Why revisit data quality measures?
The traditional definition of data or information quality includes dimensions, such as validity, accuracy, completeness, consistency, and uniformity. Nevertheless, this long-established definition of data quality does not yet consider modern AI systems and their requirements. Furthermore, there is not much research on the explainability of machine learning models in terms of the quality of the training/testing data.
What is the goal of the seminar?
In this seminar, we will introduce you to the field of data quality and explore together the correlation between data quality and AI model performance. To achieve that, we have the following plan:
- Kickoff Phase: Each team ideally consists of 2 students and will be assigned a specific task: classification, regression, etc. Your part is to choose one or more representative models to solve this task with the respective datasets (see datasets section).
- Research: Each team will explore the effect of reducing the quality of the data, concerning different quality dimensions, on the performance of the chosen models. we will provide you with state-of-the-art papers in the field of data quality for AI. More details about the dimensions and experimental setup will be provided at the beginning of this phase.
- Deliverable: The outcome of the seminar is a paper-style technical report that the teams will write collaboratively to present the results of the conducted analysis. In addition to the code, models, and the clean/polluted datasets that have been produced.
- Bonus: You will learn how to read/write a research paper and how to conduct scientific experiments and present the results in a paper.
Prerequisites
For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks. The seminar also requires basic knowledge about machine learning algorithms.
Organization
The organizational details for this seminar are as follows:
- Project seminar for master students
- Language of instruction: English
- 6 credit points, 4 SWS
- At most 6 participants (ideally, 3 teams of 2 students each)
- We plan the course to be on-site. However, we will switch to hybrid/online mode if the regulation changes.
Registration
After the introduction to the seminar on 26.10.2021 at 13:30 in HS3, please send an e-mail to hazar.harmouch@hpi.de with the subject: "Registration to Data Quality for AI" by Friday 29.10. We will notify the selected applicants by Monday the 1st of November.
In case of more than six registrations, we might need to choose up to six participants randomly. If you would like to join as a team, you can also mention that in the email. The registered students will receive an e-mail with further details about the seminar. Please register with the Studienreferat after we acknowledged your seminar participation.
Time Table
When: Weekly on Tuesday at 13:30
Where: Campus II, Building F, Room 2.10 (instead of E.06).
The following timetable lists the main semester milestones and it still tentative
Date | Topic | Slides |
26.10.2021 | Introduction (Open to all students and only for this date on HS3) | Download |
2.11.2021 | Group allocation and technical setup introduction | |
9.11.2021 | Basics of literature search and giving technical talks | |
| 14.12.2021 | Technical talk to present a research paper | |
| Christmas break | ||
| 11.01.2022 | Mid-term presentation | |
| 18.01.2022 | Guest talk: JENGA framework by Felix Biessmann and co. | |
| 25.01.2022 | Guest talk: Cedric Renggli | |
| 15.02.2022 | End-term presentation | |
| 11.03.2022 | Final submission |
Literature
To get introduced to data quality and to get a better feeling to which extent data quality affects AI, you can start with reading the following literature that you can find on dblp or google-scholar:
Traditional data quality
- R.Y. Wang and D.M. Strong. Beyond accuracy: What data quality means to data consumers. Management of Information Systems, 12(4):5–34, 1996.
-
F. Naumann and C. Rolker. Assessment methods for information quality criteria. In Proceedings of the International Conference on Information Quality (ICIQ), 148–162, 2000.
-
L.L. Pipino, Y.W. Lee, and R.Y. Wang. Data quality assessment. Communications of the ACM, 45(4):211–218, 2002.
-
S. Sadiq and M. Indulska. Open data: Quality over quantity. International Journal of Information Management, 37:150–154, 2017.
Data quality and AI
- T. Makaba and E. Dogo. A Comparison of Strategies for Missing Values in Data on Machine Learning Classification Algorithms, International Multidisciplinary Information Technology and Engineering Conference (IMITEC), 1-7, 2019.
-
B. Frenay and M. Verleysen. Classification in the Presence of Label Noise: A Survey. In IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, 845-869, May 2014.
-
F.R. Cordeiro and G. Carneiro. A Survey on Deep Learning with Noisy Labels: How to train your model when you cannot trust on the annotations?. SIBGRAPI Conference on Graphics, Patterns and Images, 2020.
To be continued.
Datasets
Sources for datasets used for AI tasks include but are not limited to the following:
- Kaggle: https://www.kaggle.com/datasets
- OpenML: https://www.openml.org/
- Google Dataset Search: https://datasetsearch.research.google.com/
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php
Grading
The final grade is weighted by 6 LP and considers the following:
- (15%) Active participation in meetings and discussions
- (15%) Technical presentation of a scientific paper
- (20%) Mid- and End-term presentation
- (20%) Quality of implementation and results
- (30%) Final paper-style submission