Open theses
The database group is always looking for good master students to advise on their master's theses. If you are interested in any of our research topics, please directly contact a faculty or any of the researchers in our team to arrange a meeting. There, we can discuss any of the topics listed below, find new topics, or you can suggest a topic of your own. Please note that the list below is only a small sample of possible thesis topics and ideas.
For more information about writing a master's theses in our group, please see here.
Master's Thesis Proposal at the Chair for Information Systems (Naumann) and Digital Health, Economics and Policy (Stern)
Real-world health datasets derived from electronic health records (EHRs) are increasingly used to develop machine learning (ML) models for clinical decision support and patient stratification. Despite their large scale and heterogeneity, such datasets may still contain underrepresented patient subgroups or combinations of characteristics that are insufficiently covered in the data. This can lead to biased ML model behavior, reduced generalizability, and poorer performance for specific patient populations.
Mount Sinai Hospital in New York City provides access to a highly diverse EHR dataset, offering a unique opportunity to investigate representation gaps in real-world clinical data. Building on the concept of Maximal Uncovered Patterns (MUPs), this thesis aims to identify underrepresented patient patterns within a hypertension remote patient monitoring (RPM) cohort and evaluate how these gaps influence downstream machine learning models designed to predict patient benefit from RPM interventions.
First, we will jointly define a clinically relevant prediction target for RPM-related outcomes. Subsequently, a feature set for the intended prediction task will be developed, and baseline ML models will be developed.
The thesis will then apply an existing MUP-detection algorithm to the Mount Sinai EHR dataset to identify underrepresented patient subgroups. Based on these patterns, an experimental framework will be developed to systematically evaluate the impact of representation gaps on ML performance. This includes (1) stress-testing baseline ML models on uncovered patient patterns (MUPs), (2) evaluating the effect of different MUP coverage thresholds, (3) comparing the robustness of different prediction models, such as logistic regression, random forests, and multilayer perceptrons, (4) identifying clusters of related MUPs to analyze whether specific types of representation gaps disproportionately affect ML model performance, and (5) developing and testing initial diversity measures for patient pattern coverage.
Objectives
- Identify MUPs in the Mount Sinai RPM EHR cohort
- Develop a ML model to predict patient response to RPM
- Assess the impact of underrepresented patient patterns on model performance
- Development and testing of diversity measures
Methodology
- Apply existing MUP detection algorithms to structured electronic health record data
- Train and evaluate predictive machine learning models for RPM-related outcomes
- Compare model performance across well-represented and underrepresented patient subgroups
Your profile (as a Master's student)
- Strong background in computer science, particularly coding in Python
- Experienced in working with big datasets and databases
- Interest in ML and healthcare settings
About the supervisor and the chair
This thesis is jointly supervised by Dr. Sedir Mohammed and Linea Schmidt. Dr. Sedir Mohammed is currently pursuing a PhD at the Information Systems chair at the Hasso Plattner Institute. His research interests include diversity assessment of datasets and its influence on downstream ML tasks.
Linea Schmidt is currently pursuing a PhD at the Digital Health, Economics and Policy chair at the Hasso Plattner Institute. Her research interests include novel care models, such as remote patient monitoring and disease management programs, as well as digital health entrepreneurship.
Data quality is a multidimensional concept that is characterized by different dimensions, such as accuracy, completeness, or consistency. Consistency captures the violation of semantic rules defined over data [1]. Especially when data is not stored in relational database, but CSV or other file formats, semantic rules such as functional dependencies are not always enforced. Here, mining partial functional dependencies can help to detect violations as potential data errors reducing the overall consistency of a dataset.
The goal of this master thesis is to:
- Extend the data profiling algorithm HyFD to efficiently mine partial FDs over time, based on [2] (see temporal inclusion dependencies [3] for the time aspect)
- Create a DQ metric for the consistency dimension (based on the results of the partial FD mining algorithm) and implement this within our existing data quality tool Metis
- Evaluate the efficiency of the mining approach and the accuracy of your consistency assessment results
[1] Batini, C., & Scannapieco, M. (2016). Data and information quality. Cham, Switzerland: Springer International Publishing, 63.
[2] Seeger, M., Papenbrock, T. (2025). Profiling of Partial Data Dependencies for Data Cleaning. (in preparation)
[3] Bornemann, L., Bleifuß, T., Kalashnikov, D. V., Nargesian, F., Naumann, F., & Srivastava, D. (2024, March). Efficient Discovery of Temporal Inclusion Dependencies in Wikipedia Tables. In Advances in database technology. OpenProceedings. org.
For more information please contact Lisa Ehrlinger. Supervision could be in cooperation with Marcian Seeger from Marburg University, Germany.
Data quality is a multidimensional concept, with completeness being one of its central dimensions. Column completeness quantifies the proportion of desired data present in a table [1] and is conventionally computed as the fraction of non-NULL values. Counting NULL values however does not capture the amount of information that is genuinely missing. Because relational tables frequently contain redundancy, a table may have missing values without losing any information. For example, consider a table with the columns date, year, month, and day: the date column may contain arbitrarily many missing values without information loss, given that the remaining three columns are complete (or vice versa). Here, the missing values can be imputed with full certainty, whereas in other cases imputation is possible only with reduced certainty.
Arenas and Libkin [2] introduce an information-theoretic measure of a cell's information content with respect to constraints such as functional and multivalued dependencies, demonstrating that well-designed relational schemas maximize per-cell information. Building on this perspective, we interpret an imputation model's certainty score as an empirical, model-based proxy for a cell's information content under the dependencies learned from the data. Given an imputation model that outputs a certainty score c alongside the imputed value, 1-c can estimate the remaining uncertainty about that cell given that the remainder of the table. A promising foundation for this estimation is conformal prediction [3], a model-agnostic framework that constructs a prediction set containing the true value with probability at least p; the cardinality of this set reflects the model's uncertainty.
The goal of this master thesis is to:
- Investigate imputation models and conformal prediction [3] to quantify per-cell uncertainty in tabular data.
- Develop a method that aggregates per-cell certainty scores into a table-level measure of missing information, moving beyond the counting of null values.
- Extend our data quality tool Metis with this completeness metric.
- Conduct a three-fold evaluation of the metric, benchmarking it against traditional completeness measures in each: (1) on polluted data with known ground truth, (2) on dirty real-world data, and (3) a runtime performance analysis.
The following challenges should be considered:
- The missingness mechanism (MCAR, MAR, MNAR) may substantially influence the results.
- Conformal prediction assumes exchangeability, an assumption that may not hold for all data (e.g., time-series data with missing values).
- A functional dependency yields either perfect imputation or complete uncertainty: for X→Y, if all values of Y are absent for a given value of X, the dependency provides no information.
[1] Naumann, F., Freytag, J. C., & Leser, U. (2004). Completeness of integrated information sources. Information Systems, 583–615. https://doi.org/10.1016/J.IS.2003.12.005
[2] Arenas, M., & Libkin, L. (2003). An information-theoretic approach to normal forms for relational and XML data. In Proceedings of PODS'03, 15–26. ACM.
[3] Angelopoulos, A. N., & Bates, S. (2021). A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. https://arxiv.org/abs/2107.07511
For more information please contact Philipp Hildebrandt or Lisa Ehrlinger. Supervision will be in cooperation with Antoon Bronselear from Ghent University.