Advanced Error Detection

Dr. Lisa Ehrlinger, Francesco Pugnaloni, Prof. Felix Naumann

Project seminar for master's students

Description

Data quality is the foundation for reliable analysis using artificial intelligence (AI) and for decision-making. Errors in datasets, such as typographical errors, duplicate records, noise, and functional dependency violations, can degrade the performance of downstream tasks like incorrect predictions of AI models [1,2]. Consequently, much research has been conducted on describing, classifying, detecting, and cleaning data errors on a general level. However, each error type presents unique challenges for its detection. Therefore, generalized error-detection tools, such as Raha [3], which perform well on average, sometimes fall short in detecting rare and underexplored error types, such as word transpositions. However, errors can occur not only at the data level (“intension”), but also on the schema-level (“extension”). For example, poor schema design leads to quality issues, such as redundant tables representing the same concept or non-atomic attributes (e.g., a single address field).

In this seminar, teams of 2 or 3 students will select a specific topic and develop advanced detection and measurement methods that outperform the current state-of-the-art. Example topics for data-level quality are “misfielded value detection”, “noise detection”, or “heterogeneous formatting error detection”. For schema quality, we offer a seminar topic on “embedding-based schema quality assessment”, where students will learn and apply table representation learning techniques (i.e., table embeddings) to automatically detect redundancies in database schemas. 
If you have any questions about the seminar in advance, please reach out to lisa.ehrlinger(at)hpi.de and francesco.pugnaloni(at)hpi.de. 

Goals of the seminar

In this seminar, you will (1) learn about the wide landscape of data-level as well as schema-level errors, detection techniques, table representation learning, and (2) develop your own solution to detect a specific data error / assess schema quality by outperforming the state of the art. To achieve that, we have the following plan: 

  • Kickoff: We will present an overview of the state of the art of data-level and schema-level error types, error detection [1,2,4], and table representation learning techniques.
  • Experimental setup: You will familiarize yourself with the provided datasets.
  • Research: Based on a “how to read a research paper” session, you will read related papers about data errors and error detection to develop your own approach on how to improve the detection of your selected error type.
  • Implement and benchmark your approach: You will implement your approach in Python. To show the effectiveness, efficiency, and scalability of your approach, you will plan and conduct experiments where you benchmark your approach to the state of the art (e.g., [3,4]). 

Deliverable: The seminar participants will jointly write a paper-style technical report to present their developed approach for the detection of their selected data error type, along with the results of the experimental evaluation. The code for error detection and evaluation shall be provided as well.

Time Table

Our meetings are currently scheduled for Thursdays from 13.30 to 15.00 in Campus II, Building F, in Room F-E.06. 

DateRoomTopicSlides
16.04.2026F-E.06Introduction
Background about error detection and table representation learning
Slides
23.04.2026F-E.06     Group allocation and topic assignment
+ Session "How to read a research paper"
Slides
30.4.2026F-E.06No meeting - paper reading 
07.05.2026F-E.06Weekly meeting and progress report 
14.05.2026F-E.06No meeting – public holiday 
21.05.2026F-2.11Weekly meeting and progress report 
28.05.2026F-2.11Weekly meeting and progress report 
04.06.2026F-2.11Weekly meeting and progress report 
11.06.2026F-2.11Mid-term presentation 
18.06.2026F-2.11Weekly meeting and progress report 
25.06.2026F-2.11Weekly meeting and progress report 
02.07.2026F-2.11Weekly meeting and progress report 
09.07.2026F-2.11Weekly meeting and progress report 
16.07.2026F-2.11Weekly meeting and progress report 
23.07.2026F-2.11End-term presentation & final submission 

 

 

Organization

General

  • Project seminar for master students
  • Language: English
  • 6 credit points, 4 SWS

Requirements

  • Programming skills in Python 

Grading
In the seminar, each team will develop an approach and write a paper-style report. The final grade is weighted by 6 ECTS and consists of the following:

  • (30%) Quality of approach
  • (10%) Quality of implementation and results
  • (10%) Midterm presentation
  • (20%) Final presentation
  • (30%) Final paper-style submission
     

Literature

To get introduced to table embeddings,schema quality, and data errors, you can start with reading the following literature, which you can find on dblp, google-scholar, or the ACM digital library.

Data errors and schema quality literature

[1] Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems.In 2nd International Workshop on Data and Information Quality. 219–233
[2] João Marcelo Borovina Josko. 2018. A Formal Taxonomy of Temporal Data Defects. In International Workshop on Data Quality and Trust (QUAT), Vol. 11235). Springer, 94–110.
[3] Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In Proceedings of the International Conference on Management of Data (SIGMOD). ACM, 865–882.
[4] Wei Ni, Xiaoye Miao, Xiangyu Zhao, Yangyang Wu, Shuwei Liang, and Jianwei Yin. 2024. Automatic Data Repair: Are We Ready to Deploy? VLDB Journal 17, 10 (2024), 2617–2630.
[5] Lisa Ehrlinger, Wolfram Wöß. A Novel Data Quality Metric for Minimality. QUAT@WISE 2018
[6] T. Cong, M. Hulsebos, Z. Sun, P. Groth, and H. V. Jagadish. Observatory: Characterizing Embeddings of Relational Tables. VLDB 2024
[7] G. Badaro, M. Saeed, and P. Papotti. Transformers for Tabular Data Representation: A Survey of Models and Applications. ACL 2023

How to read a paper

[8] Keshav, S. (2007). How to read a paper. ACM SIGCOMM Computer Communication Review, 37(3), 83-84.