Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Data Quality Foundations (VL, MSc)

Dr. Lisa Ehrlinger

High-quality data is the basis for decision-making in enterprises, making data quality assessment a critical concern for any organization. A few years ago, decision makers were still able to manually assess and interpret the quality of data at hand. However, with recent advances in digitalization and the deployment of artificial intelligence (AI) systems in practice, the amount of data being collected, stored, and consequently used for automated decision-making, exceeds the capabilities of humans to process it. Hence, an urgent need for automated data quality assessment and improvement methods has developed. 

This lecture provides a comprehensive foundation in data quality assessment and improvement. Beginning with an overview of the field's development and various perspectives on data quality, we will explore each key data quality dimension in detail, including completeness, consistency, minimality, and diversity. For each dimension, you will learn assessment methods, measurement metrics, and the specific data error types associated with them. We will then examine different data quality tools and error pollution techniques used for evaluation purposes. The final session focuses on methodological approaches for managing data quality within organizational contexts.

This lecture is essential for future data science professionals working in companies and handling test and training data for AI systems. We will go beyond simple data preprocessing to cover comprehensive methods for managing data quality enterprise-wide.

Literature

Carlo Batini and Monica Scannapieco: Data and Information Quality – Dimensions, Principles and Techniques, Springer International Publishing AG, 2016.

The book can be borrowed from our institute or ordered online (e.g., via Amazon). 

There are also other textbooks on data quality, which are suitable to support the lecture. I can recommend the following, for example: 

In addition, the following online available publications are recommendet: 

Lectures

The lecture will take place once a week on Tuesdays from 11:00-12.30 in FE.06. There will be no separate exercise session, but interactive elements such as practical tasks and quizzes will be integrated into the weekly lecture. Changes to room or time will be updated timely on this website. 

 

DateTimeRoomTopic
Tue, 14.10.2511:00-12.30FE.06Introduction and overview
Tue, 21.10.2511:00-12.30FE.06Data quality key concepts
Tue, 28.10.2511:00-12.30FE.06Data errors and pollution
Tue, 04.11.2511:00-12.30FE.06No lecture - "teaching day"
Tue, 11.11.2511:00-12.30FE.06Data quality tools 
Tue, 18.11.2511:00-12.30FE.06Completeness 
Tue, 25.11.2511:00-12.30FE.06Accuracy and correctness 
Tue, 02.12.2511:00-12.30FE.06Consistency 
Tue, 09.12.2511:00-12.30FE.06Minimality 
Tue, 16.12.2511:00-12.30FE.06Readability and understandability
Tue, 23.12.2511:00-12.30FE.06No lecture - Christmas break
Tue, 30.12.2511:00-12.30FE.06No lecture - Christmas break
Tue, 06.01.2611:00-12.30FE.06Diversity (Guest lecture by Dr. Sedir Mohammed)
Tue, 13.01.2611:00-12.30FE.06Timeliness
Tue, 20.01.2611:00-12.30FE.06Data quality management and governance
Tue, 03.02.2611:00-12.30FE.06Exam preparation

Exam

The grade will be determined by a written exam. The date and room of the exam will be published in October 2025.