Advanced Error Detection
Dr. Lisa Ehrlinger, Francesco Pugnaloni, Prof. Felix Naumann
Project seminar for master's students
Description
Data quality is the foundation for reliable analysis using artificial intelligence (AI) and for decision-making. Errors in datasets, such as typographical errors, duplicate records, noise, and functional dependency violations, can degrade the performance of downstream tasks like incorrect predictions of AI models [1,2]. Consequently, much research has been conducted on describing, classifying, detecting, and cleaning data errors on a general level. However, each error type presents unique challenges for its detection. Therefore, generalized error-detection tools, such as Raha [3], which perform well on average, sometimes fall short in detecting rare and underexplored error types, such as word transpositions. However, errors can occur not only at the data level (“intension”), but also on the schema-level (“extension”). For example, poor schema design leads to quality issues, such as redundant tables representing the same concept or non-atomic attributes (e.g., a single address field).
In this seminar, teams of 2 or 3 students will select a specific topic and develop advanced detection and measurement methods that outperform the current state-of-the-art. Example topics for data-level quality are “misfielded value detection”, “noise detection”, or “heterogeneous formatting error detection”. For schema quality, we offer a seminar topic on “embedding-based schema quality assessment”, where students will learn and apply table representation learning techniques (i.e., table embeddings) to automatically detect redundancies in database schemas.
If you have any questions about the seminar in advance, please reach out to lisa.ehrlinger(at)hpi.de and francesco.pugnaloni(at)hpi.de.
Goals of the seminar
In this seminar, you will (1) learn about the wide landscape of data-level as well as schema-level errors, detection techniques, table representation learning, and (2) develop your own solution to detect a specific data error / assess schema quality by outperforming the state of the art. To achieve that, we have the following plan:
- Kickoff: We will present an overview of the state of the art of data-level and schema-level error types, error detection [1,2,4], and table representation learning techniques.
- Experimental setup: You will familiarize yourself with the provided datasets.
- Research: Based on a “how to read a research paper” session, you will read related papers about data errors and error detection to develop your own approach on how to improve the detection of your selected error type.
- Implement and benchmark your approach: You will implement your approach in Python. To show the effectiveness, efficiency, and scalability of your approach, you will plan and conduct experiments where you benchmark your approach to the state of the art (e.g., [3,4]).
Deliverable: The seminar participants will jointly write a paper-style technical report to present their developed approach for the detection of their selected data error type, along with the results of the experimental evaluation. The code for error detection and evaluation shall be provided as well.
Time Table
Our meetings are currently scheduled for Thursdays from 13.30 to 15.00 in Campus II, Building F, in Room F-E.06.
| Date | Room | Topic | Slides |
| 16.04.2026 | F-E.06 | Introduction Background about error detection and table representation learning | Slides |
| 23.04.2026 | F-E.06 | Group allocation and topic assignment + Session "How to read a research paper" | Slides |
| 30.4.2026 | F-E.06 | No meeting - paper reading | |
| 07.05.2026 | F-E.06 | Weekly meeting and progress report | |
| 14.05.2026 | F-E.06 | No meeting – public holiday | |
| 21.05.2026 | F-2.11 | Weekly meeting and progress report | |
| 28.05.2026 | F-2.11 | Weekly meeting and progress report | |
| 04.06.2026 | F-2.11 | Weekly meeting and progress report | |
| 11.06.2026 | F-2.11 | Mid-term presentation | |
| 18.06.2026 | F-2.11 | Weekly meeting and progress report | |
| 25.06.2026 | F-2.11 | Weekly meeting and progress report | |
| 02.07.2026 | F-2.11 | Weekly meeting and progress report | |
| 09.07.2026 | F-2.11 | Weekly meeting and progress report | |
| 16.07.2026 | F-2.11 | Weekly meeting and progress report | |
| 23.07.2026 | F-2.11 | End-term presentation & final submission |
Organization
General
- Project seminar for master students
- Language: English
- 6 credit points, 4 SWS
Requirements
- Programming skills in Python
Grading
In the seminar, each team will develop an approach and write a paper-style report. The final grade is weighted by 6 ECTS and consists of the following:
- (30%) Quality of approach
- (10%) Quality of implementation and results
- (10%) Midterm presentation
- (20%) Final presentation
- (30%) Final paper-style submission
Literature
To get introduced to table embeddings,schema quality, and data errors, you can start with reading the following literature, which you can find on dblp, google-scholar, or the ACM digital library.
Data errors and schema quality literature
[1] Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems.In 2nd International Workshop on Data and Information Quality. 219–233
[2] João Marcelo Borovina Josko. 2018. A Formal Taxonomy of Temporal Data Defects. In International Workshop on Data Quality and Trust (QUAT), Vol. 11235). Springer, 94–110.
[3] Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In Proceedings of the International Conference on Management of Data (SIGMOD). ACM, 865–882.
[4] Wei Ni, Xiaoye Miao, Xiangyu Zhao, Yangyang Wu, Shuwei Liang, and Jianwei Yin. 2024. Automatic Data Repair: Are We Ready to Deploy? VLDB Journal 17, 10 (2024), 2617–2630.
[5] Lisa Ehrlinger, Wolfram Wöß. A Novel Data Quality Metric for Minimality. QUAT@WISE 2018
[6] T. Cong, M. Hulsebos, Z. Sun, P. Groth, and H. V. Jagadish. Observatory: Characterizing Embeddings of Relational Tables. VLDB 2024
[7] G. Badaro, M. Saeed, and P. Papotti. Transformers for Tabular Data Representation: A Survey of Models and Applications. ACL 2023
How to read a paper
[8] Keshav, S. (2007). How to read a paper. ACM SIGCOMM Computer Communication Review, 37(3), 83-84.