Advanced Error Detection

Lecturer

Dr. Lisa Ehrlinger
Francesco Pugnaloni
Prof. Dr. Felix Naumann

General information

Semester: SO 2026
hrs/wk: 4
ECTS: 6
Registration Time: 01/04/2026 - 30/04/2026
Course type: Project seminar (PS)
Lecturer Language: Englisch

Study programs, module groups & modules

M.Sc. Computer Science
- Specialised Studies
  - II Track: Algorithms and Foundations
    - Deep Dive
      - HPI-CS-AAD: Applied Algorithms - Deep Dive
    - Specialization
      - HPI-CS-AAS: Applied Algorithms - Specialization
  - I Track: Data and AI
    - Deep Dive
      - HPI-CS-DID: Data Integration - Deep Dive
      - HPI-CS-AID: AI Applications - Deep Dive
    - Specialization
      - HPI-CS-AIS: AI Applications - Specialization
      - HPI-CS-DIS: Data Integration - Specialization
  - III Track: Systems
    - Specialization
      - HPI-CS-DAS: Data Systems - Specialization
    - Deep Dive
      - HPI-CS-DAD: Data Systems - Deep Dive
  - V Track: Security Engineering
    - Deep Dive
      - HPI-CS-DAD: Data Systems - Deep Dive
- Mandatory Modules
  - I Track: Data and AI
    - HPI-CS-DA-CR: Critical Reading and Discussion
  - VI Track: Open Track
    - HPI-CS-O-CR: Critical Reading and Discussion
M.Sc. IT-Systems Engineering
- Operating Systems and Information Systems Technology (OSIS)
  - Concepts and Methods (HPI-OSIS-K)
  - Specialization (HPI-OSIS-S)
  - Technologies and Tools (HPI-OSIS-T)
M.Sc. Data Engineering
- Data Analytics (DANA)
  - Concepts and methods (HPI-DANA-K)
  - Specialization (HPI-DANA-S)
  - techniques and tools (HPI-DANA-T)
M. Sc. Software Systems Engineering
- Machine Learning and Analytics (MALA)
  - Technologies and Tools (HPI-MALA-T)
  - Concepts and Methods (HPI-MALA-C)
  - Specialization (HPI-MALA-S)
- Data-Driven Systems (DSYS)
  - Technologies and Tools (HPI-DSYS-T)
  - Concepts and Methods (HPI-DSYS-C)
  - Specialization (HPI-DSYS-S)

More information

Description

Data quality is the foundation for reliable analysis using artificial intelligence (AI) and for decision-making. Errors in datasets, such as typographical errors, duplicate records, noise, and functional dependency violations, can degrade the performance of downstream tasks like incorrect predictions of AI models [1,2]. Consequently, much research has been conducted on describing, classifying, detecting, and cleaning data errors on a general level. However, each error type presents unique challenges for its detection. Therefore, generalized error-detection tools, such as Raha [3], which perform well on average, sometimes fall short in detecting rare and underexplored error types, such as word transpositions. However, errors can occur not only at the data level (“intension”), but also on the schema-level (“extension”). For example, poor schema design leads to quality issues, such as redundant tables representing the same concept or non-atomic attributes (e.g., a single address field).

In this seminar, teams of 2 or 3 students will select a specific topic and develop advanced detection and measurement methods that outperform the current state-of-the-art. Example topics for data-level quality are “misfielded value detection”, “noise detection”, or “heterogeneous formatting error detection”. For schema quality, we offer a seminar topic on “embedding-based schema quality assessment”, where students will learn and apply table representation learning techniques (i.e., table embeddings) to automatically detect redundancies in database schemas.

If you have any questions about the seminar in advance, please reach out to lisa.ehrlinger(at)hpi.de and francesco.pugnaloni(at)hpi.de. Updated information on the course schedule can be found on our website: https://hpi.de/en/database-group/teaching/summer-term-2026/advanced-error-detection

Prerequisites

Programming in Python

Literature References

In this seminar, you will (1) learn about the wide landscape of data-level as well as schema-level errors, detection techniques, table representation learning, and (2) develop your own solution to detect a specific data error / assess schema quality by outperforming the state of the art. To achieve that, we have the following plan:

Kickoff: We will present an overview of the state of the art of data-level and schema-level error types, error detection [1,2,4], and table representation learning techniques.
Experimental setup: You will familiarize yourself with the provided datasets.

Research: Based on a “how to read a research paper” session, you will read related papers about data errors and error detection to develop your own approach on how to improve the detection of your selected error type.
Implement and benchmark your approach: You will implement your approach in Python. To show the effectiveness, efficiency, and scalability of your approach, you will plan and conduct experiments where you benchmark your approach to the state of the art (e.g., [3,4]).

Deliverable: The seminar participants will jointly write a paper-style technical report to present their developed approach for the detection of their selected data error type, along with the results of the experimental evaluation. The code for error detection and evaluation shall be provided as well.

[1] Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems.In 2nd International Workshop on Data and Information Quality. 219–233

[2] João Marcelo Borovina Josko. 2018. A Formal Taxonomy of Temporal Data Defects. In International Workshop on Data Quality and Trust (QUAT), Vol. 11235). Springer, 94–110.

[3] Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In Proceedings of the International Conference on Management of Data (SIGMOD). ACM, 865–882.

[4] Wei Ni, Xiaoye Miao, Xiangyu Zhao, Yangyang Wu, Shuwei Liang, and Jianwei Yin. 2024. Automatic Data Repair: Are We Ready to Deploy? VLDB Journal 17, 10 (2024), 2617–2630.

[5] Lisa Ehrlinger, Wolfram Wöß. A Novel Data Quality Metric for Minimality. QUAT@WISE 2018

[6] T. Cong, M. Hulsebos, Z. Sun, P. Groth, and H. V. Jagadish. Observatory: Characterizing Embeddings of Relational Tables. VLDB 2024

[7] G. Badaro, M. Saeed, and P. Papotti. Transformers for Tabular Data Representation: A Survey of Models and Applications. ACL 2023