Open theses

The database group is always looking for good master students to advise on their master's theses. If you are interested in any of our research topics, please directly contact a faculty or any of the researchers in our team to arrange a meeting. There, we can discuss any of the topics listed below, find new topics, or you can suggest a topic of your own. Please note that the list below is only a small sample of possible thesis topics and ideas.

For more information about writing a master's theses in our group, please see here.

Master's Thesis Proposal at the Chair for Information Systems (Naumann) and Digital Health, Economics and Policy (Stern)

Real-world health datasets derived from electronic health records (EHRs) are increasingly used to develop machine learning (ML) models for clinical decision support and patient stratification. Despite their large scale and heterogeneity, such datasets may still contain underrepresented patient subgroups or combinations of characteristics that are insufficiently covered in the data. This can lead to biased ML model behavior, reduced generalizability, and poorer performance for specific patient populations.

Mount Sinai Hospital in New York City provides access to a highly diverse EHR dataset, offering a unique opportunity to investigate representation gaps in real-world clinical data. Building on the concept of Maximal Uncovered Patterns (MUPs), this thesis aims to identify underrepresented patient patterns within a hypertension remote patient monitoring (RPM) cohort and evaluate how these gaps influence downstream machine learning models designed to predict patient benefit from RPM interventions.

First, we will jointly define a clinically relevant prediction target for RPM-related outcomes. Subsequently, a feature set for the intended prediction task will be developed, and baseline ML models will be developed.

The thesis will then apply an existing MUP-detection algorithm to the Mount Sinai EHR dataset to identify underrepresented patient subgroups. Based on these patterns, an experimental framework will be developed to systematically evaluate the impact of representation gaps on ML performance. This includes (1) stress-testing baseline ML models on uncovered patient patterns (MUPs), (2) evaluating the effect of different MUP coverage thresholds, (3) comparing the robustness of different prediction models, such as logistic regression, random forests, and multilayer perceptrons, (4) identifying clusters of related MUPs to analyze whether specific types of representation gaps disproportionately affect ML model performance, and (5) developing and testing initial diversity measures for patient pattern coverage.

Objectives

Identify MUPs in the Mount Sinai RPM EHR cohort
Develop a ML model to predict patient response to RPM
Assess the impact of underrepresented patient patterns on model performance
Development and testing of diversity measures

Methodology

Apply existing MUP detection algorithms to structured electronic health record data
Train and evaluate predictive machine learning models for RPM-related outcomes
Compare model performance across well-represented and underrepresented patient subgroups

Your profile (as a Master's student)

Strong background in computer science, particularly coding in Python
Experienced in working with big datasets and databases
Interest in ML and healthcare settings

About the supervisor and the chair

This thesis is jointly supervised by Dr. Sedir Mohammed and Linea Schmidt. Dr. Sedir Mohammed is currently pursuing a PhD at the Information Systems chair at the Hasso Plattner Institute. His research interests include diversity assessment of datasets and its influence on downstream ML tasks.
Linea Schmidt is currently pursuing a PhD at the Digital Health, Economics and Policy chair at the Hasso Plattner Institute. Her research interests include novel care models, such as remote patient monitoring and disease management programs, as well as digital health entrepreneurship.

Data quality is a multidimensional concept, with completeness being one of its central dimensions. Column completeness quantifies the proportion of desired data present in a table [1] and is conventionally computed as the fraction of non-NULL values. Counting NULL values however does not capture the amount of information that is genuinely missing. Because relational tables frequently contain redundancy, a table may have missing values without losing any information. For example, consider a table with the columns date, year, month, and day: the date column may contain arbitrarily many missing values without information loss, given that the remaining three columns are complete (or vice versa). Here, the missing values can be imputed with full certainty, whereas in other cases imputation is possible only with reduced certainty.

Arenas and Libkin [2] introduce an information-theoretic measure of a cell's information content with respect to constraints such as functional and multivalued dependencies, demonstrating that well-designed relational schemas maximize per-cell information. Building on this perspective, we interpret an imputation model's certainty score as an empirical, model-based proxy for a cell's information content under the dependencies learned from the data. Given an imputation model that outputs a certainty score c alongside the imputed value, 1-c can estimate the remaining uncertainty about that cell given that the remainder of the table. A promising foundation for this estimation is conformal prediction [3], a model-agnostic framework that constructs a prediction set containing the true value with probability at least p; the cardinality of this set reflects the model's uncertainty.

The goal of this master thesis is to:

Investigate imputation models and conformal prediction [3] to quantify per-cell uncertainty in tabular data.
Develop a method that aggregates per-cell certainty scores into a table-level measure of missing information, moving beyond the counting of null values.
Extend our data quality tool Metis with this completeness metric.
Conduct a three-fold evaluation of the metric, benchmarking it against traditional completeness measures in each: (1) on polluted data with known ground truth, (2) on dirty real-world data, and (3) a runtime performance analysis.

The following challenges should be considered:

The missingness mechanism (MCAR, MAR, MNAR) may substantially influence the results.
Conformal prediction assumes exchangeability, an assumption that may not hold for all data (e.g., time-series data with missing values).
A functional dependency yields either perfect imputation or complete uncertainty: for X→Y, if all values of Y are absent for a given value of X, the dependency provides no information.

[1] Naumann, F., Freytag, J. C., & Leser, U. (2004). Completeness of integrated information sources. Information Systems, 583–615. https://doi.org/10.1016/J.IS.2003.12.005
[2] Arenas, M., & Libkin, L. (2003). An information-theoretic approach to normal forms for relational and XML data. In Proceedings of PODS'03, 15–26. ACM.
[3] Angelopoulos, A. N., & Bates, S. (2021). A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. https://arxiv.org/abs/2107.07511

For more information please contact Philipp Hildebrandt or Lisa Ehrlinger. Supervision will be in cooperation with Antoon Bronselear from Ghent University.

Open theses

The Effect of Underrepresented Patient Patterns on ML Performance for Remote Patient Monitoring

Master's Thesis Proposal at the Chair for Information Systems (Naumann) and Digital Health, Economics and Policy (Stern)

Objectives

Methodology

Your profile (as a Master's student)

About the supervisor and the chair

Assessing and Monitoring the Consistency of Data

Assessing Information Completeness through Imputation Certainty