Data Quality Foundations (VL, MSc)

Dr. Lisa Ehrlinger

High-quality data is the basis for decision-making in enterprises, making data quality assessment a critical concern for any organization. A few years ago, decision makers were still able to manually assess and interpret the quality of data at hand. However, with recent advances in digitalization and the deployment of artificial intelligence (AI) systems in practice, the amount of data being collected, stored, and consequently used for automated decision-making, exceeds the capabilities of humans to process it. Hence, an urgent need for automated data quality assessment and improvement methods has developed.

This lecture provides a comprehensive foundation in data quality assessment and improvement. Beginning with an overview of the field's development and various perspectives on data quality, we will explore each key data quality dimension in detail, including completeness, consistency, minimality, and diversity. For each dimension, you will learn assessment methods, measurement metrics, and the specific data error types associated with them. We will then examine different data quality tools and error pollution techniques used for evaluation purposes. The final session focuses on methodological approaches for managing data quality within organizational contexts.

This lecture is essential for future data science professionals working in companies and handling test and training data for AI systems. We will go beyond simple data preprocessing to cover comprehensive methods for managing data quality enterprise-wide.

Literature

Carlo Batini and Monica Scannapieco: Data and Information Quality – Dimensions, Principles and Techniques, Springer International Publishing AG, 2016.

The book can be borrowed from our institute or ordered online (e.g., via Amazon).

There are also other textbooks on data quality, which are suitable to support the lecture. I can recommend the following, for example:

Sebastian-Coleman, L. (2012). Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework. Newnes.
Dasu, T., & Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons.

In addition, the following online available publications are recommendet:

Madnick, S. E., Wang, R. Y., Lee, Y. W., & Zhu, H. (2009). Overview and Framework for Data and Information Quality Research. ACM Journal of Data and Information Quality (JDIQ), 1(1), 1-22.
Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data Quality Assessment. Communications of the ACM, 45(4), 211-218.

Lectures

The lecture will take place once a week on Tuesdays from 11:00-12.30 in FE.06. There will be no separate exercise session, but interactive elements such as practical tasks and quizzes will be integrated into the weekly lecture. Changes to room or time will be updated timely on this website.

Date	Time	Room	Topic
Tue, 14.10.25	11:00-12.30	FE.06	Introduction and overview
Tue, 21.10.25	11:00-12.30	FE.06	Data quality key concepts
Tue, 28.10.25	11:00-12.30	FE.06	Data errors and pollution
Tue, 04.11.25	11:00-12.30	FE.06	No lecture - "teaching day"
Tue, 11.11.25	11:00-12.30	FE.06	Data quality tools
Tue, 18.11.25	11:00-12.30	FE.06	Completeness
Tue, 25.11.25	11:00-12.30	FE.06	Accuracy and correctness
Tue, 02.12.25	11:00-12.30	FE.06	Consistency
Tue, 09.12.25	11:00-12.30	FE.06	Minimality
Tue, 16.12.25	11:00-12.30	FE.06	Readability and understandability
Tue, 23.12.25	11:00-12.30	FE.06	No lecture - Christmas break
Tue, 30.12.25	11:00-12.30	FE.06	No lecture - Christmas break
Tue, 06.01.26	11:00-12.30	FE.06	Diversity (Guest lecture by Dr. Sedir Mohammed)
Tue, 13.01.26	11:00-12.30	FE.06	Timeliness
Tue, 20.01.26	11:00-12.30	FE.06	Data quality management and governance
Tue, 03.02.26	11:00-12.30	FE.06	Exam preparation

Exam

The grade will be determined by a written exam. The date and room of the exam will be published in October 2025.