Data Quality for AI (Sommersemester 2023)

Lecturer: Prof. Dr. Felix Naumann (Information Systems) , Dr. Hazar Harmouch (Information Systems) , Sedir Mohammed
Course Website:

General Information

Weekly Hours: 4
Credits: 6
Graded: yes
Enrolment Deadline: 01.04.2023 - 07.05.2023
Teaching Form: Project / Seminar
Enrolment Type: Compulsory Elective Module
Course Language: English
Maximum number of participants: 6

Programs, Module Groups & Modules

IT-Systems Engineering MA

Data Engineering MA

DANA: Data Analytics
- HPI-DANA-K Konzepte und Methoden
DANA: Data Analytics
- HPI-DANA-T Techniken und Werkzeuge
DANA: Data Analytics
- HPI-DANA-S Spezialisierung
CODS: Complex Data Systems
- HPI-CODS-K Konzepte und Methoden
CODS: Complex Data Systems
- HPI-CODS-T Techniken und Werkzeuge
CODS: Complex Data Systems
- HPI-CODS-S Spezialisierung

Software Systems Engineering MA

SSYS: Software Systems
- HPI-SSYS-C Concepts and Methods
SSYS: Software Systems
- HPI-SSYS-T Technologies and Tools
SSYS: Software Systems
- HPI-SSYS-S Specialization
DSYS: Data-Driven Systems
- HPI-DSYS-C Concepts and Methods
DSYS: Data-Driven Systems
- HPI-DSYS-T Technologies and Tools
DSYS: Data-Driven Systems
- HPI-DSYS-S Specialization

Description

Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to by Garbage in, garbage out (GIGO). GIGO is used to express the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output¹. High-performance AI applications require high-quality training and test data.

This data could include personal information, sensitive financial details, and confidential business data. Nevertheless, privacy is a fundamental human right, and it is essential to protect personal information to ensure trust and maintain a fair and just society. One common approach to address these concerns is to use anonymized data in machine learning algorithms. There is no substantial research that demonstrates the effect of anonymization on the data quality and thus on the downstream ML application. Differential privacy and k-anonymity are the most used families of anonymization techniques.

What is the goal of the seminar?

In this seminar, we will introduce you to the field of data quality, and explore together the impact of anonymization techniques on data quality and AI model performance. To achieve that, we have the following plan:

Kickoff Phase: Each team ideally consists of 2 students and will be assigned a specific task: classification, regression, etc. Your part is to choose one or more representative models (e.g., SVM for classification) to solve this task with the respective datasets (see datasets section). The datasets need to contain protected features such as age that we will try to anonymize.
Research: Each team will explore the effect of anonymization of the data on data quality regarding the well-known data quality dimensions. This includes: (1) understanding the anonymization algorithms assigned to each team and implementing them. (2) Building an ML-pipeline that uses anonymized data to train the ML models this team has chosen. (3) reporting on the performance of the chosen models regarding the degree of anonymization and showing the trade-off. We will provide you with state-of-the-art papers in the field of data quality, differential privacy and k-anonymity. More details about the dimensions and experimental setup will be provided at the beginning of this phase.
Deliverable: The outcome of the seminar is a paper-style technical report that the teams will write collaboratively to present the results of the conducted analysis. In addition to the code, models, and the datasets that have been produced.
Bonus: You will learn how to read/write a research paper and how to conduct scientific experiments and present the results in a paper.

Requirements

For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks. The seminar also requires basic knowledge about machine learning algorithms.

Learning

Project seminar with weekly meetings

Examination

Intermediate and final presentation
Demonstration and report of method implementation and its experimental results

Dates

See the Website of the Information Systems Chair.

Zurück

Data Quality for AI (Sommersemester 2023)

General Information

Programs, Module Groups & Modules

Description

What is the goal of the seminar?

Requirements

Learning

Examination

Dates

HPI Merch - Now available for online order

Events

22.04.2024 | Digital Master Information Day 2024

16.05.2024 | HPI Connect Fair

19.06.2024 | Potsdam Conference for National CyberSecurity

19.07.2024 | HPI Summer Festival 2024

12.05.2022 | Women in Tech Conference