DQ4AI: Data Quality Assessment

Dr. Lisa Ehrlinger and Sedir Mohammed

Description

Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to as Garbage in, garbage out (GIGO). GIGO expresses the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output¹. High-performance AI applications require high-quality training and test data.

Research has divided data quality (DQ) into various dimensions, such as accuracy, consistency, and reputation. Despite extensive research into DQ dimensions over the last decades, there is no consensus on how these dimensions should be measured.

What is the goal of the seminar?

In this seminar, we will (1) introduce you to the field of data quality, (2) jointly investigate methods to measure different DQ dimensions, and (3) develop a framework that implements these methods (e.g., metrics, algorithms) in practice.

To achieve that, we have the following plan:

Kickoff Phase: We will present an overview of the state of the art of DQ assessment and existing DQ dimensions. The entire group will discuss the envisioned architecture of the framework. Following this, each team (ideally 2 students) can select a DQ dimension (e.g., accuracy, completeness, representativity, relevancy) for implementation within the framework.
Research: Each team will read related work to their selected DQ dimension and prepare a presentation for the entire group. The team should also propose their idea on how to implement the selected DQ dimension in the targeted framework.
Deliverable: The teams will collaboratively write a paper-style technical report to present their developed DQ assessment framework, the selected DQ dimensions, and an experimental evaluation. The code of the framework itself and the experimental evaluation need to be provided as well.
Bonus: You will learn how to read and write scientific research papers.

Prerequisites

For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks.

Organization

The organizational details for this seminar are as follows:

Project seminar for master students
Language of instruction: English
6 credit points, 4 SWS
At most 10 participants (ideally, 5 teams of 2 students each)

Time Table

Our meetings are currently scheduled for Mondays from 17.00 to 18.30 in Campus II, Building F, in Room F-2.11. In our first meeting, we will discuss on possible alternative times that are suitable for everyone.

The following timetable lists the main semester milestones and it still tentative.

Date	Time	Room	Topic	Slides
14.10.2024	17.00-18.30	F-2.11	Introduction	Slides
21.10.2024	17.00-18.30	F-2.11	Group allocation and topic selection
22.10.2024	15.15-16.45	F-E.06	"How to read a research paper"	Slides
04.11.2024	17.00-18.30	F-2.11	Report on progress and questions
18.11.2024	17:00-18:30	F-2.11	Report on progress and questions
25.11.2024	17:00-18:30	F-2.11	Report on progress and questions
02.12.2024	17:00-18:30	F-2.10	Mid-term presentation
09.12.2024	17:00-18:30	F-2.11	Report on progress and questions
16.12.2024	17:00-18:30	F-2.11	Report on progress and questions
06.01.2025	17:00-18:30	F-2.11	Report on progress and questions
20.01.2025	17:00-18:30	F-2.11	Report on progress and questions
14.02.2025	13:30-14:30	F-2.10	End-term presentation
01.03.2025	23:59		Final submission

Literature

To get introduced to data quality and its various dimensions, you can start with reading the following literature that you can find on dblp or google-scholar:

How to read a paper

Keshav, S. (2007). How to read a paper. ACM SIGCOMM Computer Communication Review, 37(3), 83-84.

Data quality literature

Ehrlinger, L., Werth, B., & Wöß, W. (2018). Automated continuous data quality measurement with QuaIIe. International Journal on Advances in Software, 11(3), 400-417.
Heinrich, B., Hristova, D., Klier, M., Schiller, A., & Szubartowicz, M. (2018). Requirements for data quality metrics. Journal of Data and Information Quality (JDIQ), 9(2), 1-32.
Mohammed, S., Harmouch, H., Naumann, F., & Srivastava, D. (2024). Data Quality Assessment: Challenges and Opportunities. arXivpreprint arXiv:2403.00526.
Mohammed, S., Brandner, L., Hallensleben, S., Harmouch, H., Hauschke, A., Heesen, J., Hildebrandt, S., Hirsbrunner, S.D., Keselj, J., Mahlow, P., Naumann, F., Rostalski, F., Wilken, A., and Wölke, A. (2023). Ein Glossar zur Datenqualität (1.2). Zenodo. https://doi.org/10.5281/zenodo.7702426 (German)
Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy : What data quality means to data consumers. Journal of managementinformation systems, 12(4), 5-33.

Grading

The final grade is weighted by 6 LP and considers the following:

(15%) Active participation in meetings and discussions
(15%) Technical presentation of DQ dimension (existing research plus own idea)
(20%) Mid- and End-term presentation
(20%) Quality of implementation and results
(30%) Final paper-style submission