Data Quality for AI (Sommersemester 2023)

Dozent: Prof. Dr. Felix Naumann (Information Systems) , Dr. Hazar Harmouch (Information Systems) , Sedir Mohammed
Website zum Kurs: https://hpi.de/naumann/teaching/current-courses/ss-23/data-quality-for-ai.html

Allgemeine Information

Semesterwochenstunden: 4
ECTS: 6
Benotet: Ja
Einschreibefrist: 01.04.2023 - 07.05.2023
Lehrform: Projekt / Seminar
Belegungsart: Wahlpflichtmodul
Lehrsprache: Englisch
Maximale Teilnehmerzahl: 6

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA

OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge

Data Engineering MA

DANA: Data Analytics
- HPI-DANA-K Konzepte und Methoden
DANA: Data Analytics
- HPI-DANA-T Techniken und Werkzeuge
DANA: Data Analytics
- HPI-DANA-S Spezialisierung
CODS: Complex Data Systems
- HPI-CODS-K Konzepte und Methoden
CODS: Complex Data Systems
- HPI-CODS-T Techniken und Werkzeuge
CODS: Complex Data Systems
- HPI-CODS-S Spezialisierung

Software Systems Engineering MA

SSYS: Software Systems
- HPI-SSYS-C Concepts and Methods
SSYS: Software Systems
- HPI-SSYS-T Technologies and Tools
SSYS: Software Systems
- HPI-SSYS-S Specialization
DSYS: Data-Driven Systems
- HPI-DSYS-C Concepts and Methods
DSYS: Data-Driven Systems
- HPI-DSYS-T Technologies and Tools
DSYS: Data-Driven Systems
- HPI-DSYS-S Specialization

Beschreibung

Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to by Garbage in, garbage out (GIGO). GIGO is used to express the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output¹. High-performance AI applications require high-quality training and test data.

This data could include personal information, sensitive financial details, and confidential business data. Nevertheless, privacy is a fundamental human right, and it is essential to protect personal information to ensure trust and maintain a fair and just society. One common approach to address these concerns is to use anonymized data in machine learning algorithms. There is no substantial research that demonstrates the effect of anonymization on the data quality and thus on the downstream ML application. Differential privacy and k-anonymity are the most used families of anonymization techniques.

What is the goal of the seminar?

In this seminar, we will introduce you to the field of data quality, and explore together the impact of anonymization techniques on data quality and AI model performance. To achieve that, we have the following plan:

Kickoff Phase: Each team ideally consists of 2 students and will be assigned a specific task: classification, regression, etc. Your part is to choose one or more representative models (e.g., SVM for classification) to solve this task with the respective datasets (see datasets section). The datasets need to contain protected features such as age that we will try to anonymize.
Research: Each team will explore the effect of anonymization of the data on data quality regarding the well-known data quality dimensions. This includes: (1) understanding the anonymization algorithms assigned to each team and implementing them. (2) Building an ML-pipeline that uses anonymized data to train the ML models this team has chosen. (3) reporting on the performance of the chosen models regarding the degree of anonymization and showing the trade-off. We will provide you with state-of-the-art papers in the field of data quality, differential privacy and k-anonymity. More details about the dimensions and experimental setup will be provided at the beginning of this phase.
Deliverable: The outcome of the seminar is a paper-style technical report that the teams will write collaboratively to present the results of the conducted analysis. In addition to the code, models, and the datasets that have been produced.
Bonus: You will learn how to read/write a research paper and how to conduct scientific experiments and present the results in a paper.

Voraussetzungen

For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks. The seminar also requires basic knowledge about machine learning algorithms.

Lern- und Lehrformen

Project seminar with weekly meetings

Leistungserfassung

Intermediate and final presentation
Demonstration and report of method implementation and its experimental results

Termine

See the Website of the Information Systems Chair.

Zurück