Data Quality for AI

Advisors

Prof. Dr. Felix Naumann and Dr. Hazar Harmouch

Description

Many AI methods are dependent on large quantities of suitable training data. This creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to by Garbage in, garbage out (GIGO). GIGO is used to express the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output¹. High-performance AI applications require high-quality training and test data.

Why revisit data quality measures?

The traditional definition of data or information quality includes dimensions, such as validity, accuracy, completeness, consistency, and uniformity. Nevertheless, this long-established definition of data quality does not yet consider modern AI systems and their requirements. Furthermore, there is not much research on the explainability of machine learning models in terms of the quality of the training/testing data.

What is the goal of the seminar?

In this seminar, we will introduce you to the field of data quality and explore together the correlation between data quality and AI model performance. To achieve that, we have the following plan:

Kickoff Phase: Each team ideally consists of 2 students and will be assigned a specific task: classification, regression, etc. Your part is to choose one or more representative models to solve this task with the respective datasets (see datasets section).
Research: Each team will explore the effect of reducing the quality of the data, concerning different quality dimensions, on the performance of the chosen models. we will provide you with state-of-the-art papers in the field of data quality for AI. More details about the dimensions and experimental setup will be provided at the beginning of this phase.
Deliverable: The outcome of the seminar is a paper-style technical report that the teams will write collaboratively to present the results of the conducted analysis. In addition to the code, models, and the clean/polluted datasets that have been produced.
Bonus: You will learn how to read/write a research paper and how to conduct scientific experiments and present the results in a paper.

Prerequisites

For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks. The seminar also requires basic knowledge about machine learning algorithms.

Organization

The organizational details for this seminar are as follows:

Project seminar for master students
Language of instruction: English
6 credit points, 4 SWS
At most 6 participants (ideally, 3 teams of 2 students each)
We plan the course to be on-site. However, we will switch to hybrid/online mode if the regulation changes.

Registration

After the introduction to the seminar on 26.10.2021 at 13:30 in HS3, please send an e-mail to hazar.harmouch@hpi.de with the subject: "Registration to Data Quality for AI" by Friday 29.10. We will notify the selected applicants by Monday the 1^st of November.

In case of more than six registrations, we might need to choose up to six participants randomly. If you would like to join as a team, you can also mention that in the email. The registered students will receive an e-mail with further details about the seminar. Please register with the Studienreferat after we acknowledged your seminar participation.

Time Table

When: Weekly on Tuesday at 13:30

Where: Campus II, Building F, Room 2.10 (instead of E.06).

The following timetable lists the main semester milestones and it still tentative

Date	Topic	Slides
26.10.2021	Introduction (Open to all students and only for this date on HS3)	Download
2.11.2021	Group allocation and technical setup introduction
9.11.2021	Basics of literature search and giving technical talks
14.12.2021	Technical talk to present a research paper
	Christmas break
11.01.2022	Mid-term presentation
18.01.2022	Guest talk: JENGA framework by Felix Biessmann and co.
25.01.2022	Guest talk: Cedric Renggli
15.02.2022	End-term presentation
11.03.2022	Final submission

Literature

To get introduced to data quality and to get a better feeling to which extent data quality affects AI, you can start with reading the following literature that you can find on dblp or google-scholar:

Traditional data quality

R.Y. Wang and D.M. Strong. Beyond accuracy: What data quality means to data consumers. Management of Information Systems, 12(4):5–34, 1996.
F. Naumann and C. Rolker. Assessment methods for information quality criteria. In Proceedings of the International Conference on Information Quality (ICIQ), 148–162, 2000.
L.L. Pipino, Y.W. Lee, and R.Y. Wang. Data quality assessment. Communications of the ACM, 45(4):211–218, 2002.
S. Sadiq and M. Indulska. Open data: Quality over quantity. International Journal of Information Management, 37:150–154, 2017.

Data quality and AI

T. Makaba and E. Dogo. A Comparison of Strategies for Missing Values in Data on Machine Learning Classification Algorithms, International Multidisciplinary Information Technology and Engineering Conference (IMITEC), 1-7, 2019.
B. Frenay and M. Verleysen. Classification in the Presence of Label Noise: A Survey. In IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, 845-869, May 2014.
F.R. Cordeiro and G. Carneiro. A Survey on Deep Learning with Noisy Labels: How to train your model when you cannot trust on the annotations?. SIBGRAPI Conference on Graphics, Patterns and Images, 2020.

To be continued.

Datasets

Sources for datasets used for AI tasks include but are not limited to the following:

Kaggle: https://www.kaggle.com/datasets
OpenML: https://www.openml.org/
Google Dataset Search: https://datasetsearch.research.google.com/
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php

Grading

The final grade is weighted by 6 LP and considers the following:

(15%) Active participation in meetings and discussions
(15%) Technical presentation of a scientific paper
(20%) Mid- and End-term presentation
(20%) Quality of implementation and results
(30%) Final paper-style submission