Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Metis - Data Quality Assessment

In Greek mythology, Metis is an Oceanid and the first wife of Zeus. She embodies prudence and wisdom.

People

  • Carolina Cortes
  • Divya Bhadauria
  • Lisa Ehrlinger
  • Lorena Etcheverry
  • Hazar Harmouch
  • Philipp Hildebrandt
  • Sedir Mohammed
  • Felix Naumann
  • Divesh Srivastava

Projects

  • QuAHD (Quality Assessment for Health Data)
    • Funding program: Digital Health Partnership (Hasso Plattner Institute for Digital Health at Mount Sinai) 
    • Involved institutions: Hasso Plattner Institute, Mount Sinai Health System New York
    • Project duration: 01.03.2025-28.02.2028 (3 years) 
    • Abstract: The large-scale analysis of health data for research allows answering questions that cannot be examined with traditional clinical trials. The platform AIR·MS (AI-Ready Mount Sinai) enables researchers to access a vast amount of health data to support their data science and analytics activities using machine learning (ML) and artificial intelligence (AI). While much data has been integrated into the AIR·MS platform, the curation and verification of data quality (DQ) is an ongoing challenge. We plan to conduct research on the systematic assessment of health data in the context of AI-based analysis. We will reach beyond the low-hanging fruits of DQ assessment that simply count rule violations. Rather, we will investigate the automatic assessment of two pertinent DQ dimensions: completeness and diversity for electronic health records (EHRs). If both dimensions are insufficiently covered, the data lacks ``information content,'' i.e., it does not tell the full story for down-stream tasks and may lead to biased or unfair results when training AI models. The overall goal of QuAHD is to design a model for automatically assessing  "information content" in data. The model should be able to deal with changes in the data, since both the measured and the target (ground truth) of completeness or diversity may change over time (e.g., considering annual medical checks).
  • QuanTD (Quantifying Trustworthiness of Data)
    • Funding program: The Austrian Research Promotion Agency -- FFG
    • Involved institutions: Software Competence Center Hagenberg GmbH, TU Vienna, WU Vienna, Robert Bosch AG, Österreichische Post AG, Hasso Plattner Institute, Johannes Kepler University Linz
    • Project duration: 01.12.2022-30.11.2025 (3 years)
    • Abstract: Missing, incorrect and especially duplicate inconsistent data cause a lot of problems and costs, i.e., data quality (DQ) is an important issue. Many different quality metrics exist, and for several DQ dimensions, but organizations in practice aim for having combined single DQ scores available. For measuring the trustworthiness of data, we propose to study developing such combined DQ scores in innovative ways, including the representation and propagation of uncertainties of metrics values based on probability theory, and a machine learning approach for combining them automatically. For explainable data quality analytics, the single DQ score can be decomposed to the aggregated metrics by a visualization component. We will evaluate our new approaches in two different organizations and validate combined DQ scores (with uncertainties assigned) in different use cases, e.g., for master data management and governance and for development process data.

 

Publications

  • The Five Facets of Data Quality Assessment
    Sedir Mohammed, Lisa Ehrlinger, Hazar Harmouch, Felix Naumann, Divesh Srivastava
    SIGMOD Record (2025) (to appear)
  • The Effects of Data Quality on Machine Learning Performance on Tabular Data
    Sedir Mohammed, Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, Hazar Harmouch
    Information Systems (2025)
    [Project Page]   [DOI:10.1016/j.is.2025.102549]
  • Step-by-Step Data Cleaning Recommendations to Improve ML Prediction Accuracy
    Sedir Mohammed, Felix Naumann, Hazar Harmouch
    Proceedings of the 28th International Conference on Extending Database Technology (EDBT), 2025
    [Project Page]   [DOI:10.48786/EDBT.2025.43]
  • Icewafl: A Configurable Data Stream Polluter
    Christoph Schinninger, Fabian Panse, Constantin Kühne, Lisa Ehrlinger
    Proceedings of the 28th International Conference on Extending Database Technology (EDBT), 2025
    [Project Page]   [DOI:10.48786/EDBT.2025.64]
  • A Data Quality Dashboard for (Security) Knowledge Graphs .
    Davyd Pizhuk, Lisa Ehrlinger, Gandalf Denk, Verena Geist
    Database Systems for Business, Technology and the Web (BTW), 2025.
    German Informatics Society, Bonn. EISSN: 2944-7682. pp. 803-810. Demo track. Bamberg. March 3-7, 2025.
    [DOI:10.18420/BTW2025-45]