Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to by Garbage in, garbage out (GIGO). GIGO is used to express the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output1. High-performance AI applications require high-quality training and test data.
This data could include personal information, sensitive financial details, and confidential business data. Nevertheless, privacy is a fundamental human right, and it is essential to protect personal information to ensure trust and maintain a fair and just society. One common approach to address these concerns is to use anonymized data in machine learning algorithms. There is no substantial research that demonstrates the effect of anonymization on the data quality and thus on the downstream ML application. Differential privacy and k-anonymity are the most used families of anonymization techniques.
What is the goal of the seminar?
In this seminar, we will introduce you to the field of data quality, and explore together the impact of anonymization techniques on data quality and AI model performance. To achieve that, we have the following plan:
- Kickoff Phase: Each team ideally consists of 2 students and will be assigned a specific task: classification, regression, etc. Your part is to choose one or more representative models (e.g., SVM for classification) to solve this task with the respective datasets (see datasets section). The datasets need to contain protected features such as age that we will try to anonymize.
- Research: Each team will explore the effect of anonymization of the data on data quality regarding the well-known data quality dimensions. This includes: (1) understanding the anonymization algorithms assigned to each team and implementing them. (2) Building an ML-pipeline that uses anonymized data to train the ML models this team has chosen. (3) reporting on the performance of the chosen models regarding the degree of anonymization and showing the trade-off. We will provide you with state-of-the-art papers in the field of data quality, differential privacy and k-anonymity. More details about the dimensions and experimental setup will be provided at the beginning of this phase.
- Deliverable: The outcome of the seminar is a paper-style technical report that the teams will write collaboratively to present the results of the conducted analysis. In addition to the code, models, and the datasets that have been produced.
- Bonus: You will learn how to read/write a research paper and how to conduct scientific experiments and present the results in a paper.