Customizing the Reference for Genome Data Analysis
Severe illnesses such as cancer have their root causes in mutations on particular positions in the genetic code. To identify these for a specific patient, a DNA sample, consisting of millions of unordered small DNA sequences, must be analyzed: First reordering the DNA sequences to form a whole human genome, i.e. alignment, and then identifying the genetic references, i.e. variant calling. Both steps require a reference genome for processing. Unfortunately, there do not exist several population-specific reference genomes, but rather one single reference that is a mix of multiple individuals, but not covering all population-groups. However, there might be population-specific differences in the genetic material that are relevant for analysis. Your task ist to equip the reference genome with population-specific genetic variants for one specific population. You then will execute the alignment analysis with the adapted reference and evaluate with your supervisor quantitative and qualitative differences between the results.
Evaluating Variant Calling Results with In-Memory Technology
Severe illnesses such as cancer have their root causes in mutations on particular positions in the genetic code. To identify these for a specific patient, a DNA sample is analyzed and compared to a reference in Variant Calling. There exist several tools for this step, all applying different underlying statistical models that potentially lead to differences in the result sets. Your task will be to develop a framework for evaluating variant calling results (two or more) to each other, by using an in-memory database. Those evaluations can be simple matchings and intersections between compared result sets, but should also include evaluating the quality of those genetic variants that are only in one of the result sets, e.g. by including knowledge from external data sources.
Variant Calling within an In-Memory Database
Severe illnesses such as cancer have their root causes in mutations on particular positions in the genetic code. To identify these for a specific patient, a DNA sample is analyzed and compared to a reference in Variant Calling. However, simply comparing the data to a single reference leads to inaccurate results because the data is error-prone and there are lots of genetic differences between populations that should be considered in calculations. Your task is to adapt variant calling to include different aspects such as known variants. You will discuss and identify relevant data sources/aspects with your supervisor, make it available within an in-memory database, and run variant detection directly inside the database. For this project, you build on an existing implementation for variant detection and refine its underlying statistical model.
Linking Medical Knowledge to Improve Precision Medicine
Huge amounts of medical insights are created nowdays, e.g. publications or medical guidelines. The challenge for doctors is to find the right medical puzzle peace at the right time. In cooperation with our cooperation partner, you will be able to acquire new requirements to extend the Medical Knowledge Cockpit for a concrete use case taken from oncology. As a result, you contribute to improve the treatment of cancer patients by enabling interactive access to relevant information to oncologists.
Interactive Data Explorer for the TCGA
The Cancer Genome Atlas (TCGA) provides data of cancer patients for researchers in a pseudonymized way. However, the exploration of the existing data requires manual download and import into relevant tools. You will be explore existing real patient data together with our cooperation partner and define requirements for an interactive analysis tool. You can build on existing funcationality created by HPI students to create an always up-to-data interactive exploration tool for TCGA data using in-memory technology. As a result, you provide oncology researchers a powerful tool for data analysis.
Integration and Harmonization of Medical Data
In the course of a clinical study various data sources are generated. In cooperation with our partner German Heart Institute Berlin you will explore data sources and propose a harmonized database model, which enables combination and interactive analysis of acquired data. You will provide examplarily analysis using existing tools or specifically designed tools for data exploration.
Analysis of Longitudinal Data
Healthcare insuranace data forms a longitudinal database of patient-specific events, e.g. treatments or medications. Identification of patterns and similar patient cases can help to improve treatment guidelines. Together with our cooperation partner you will analyze selected insurance data and apply pattern recognition algorithms, e.g. machine learning.