In-Memory Databases: Applications in Healthcare

General Information

Overall responsibility: Prof. Dr. h.c. Hasso Plattner
Teaching staff: Dr. Matthieu-P. Schapranow, Cindy Perscheid, Dr. Mariana Neves, Dr. Matthias Uflacker
Location: D.E-9/10, Campus II
Time: Tue and Thu 9.15 - 10.45 a.m.
First course: Apr 21, 2015
4 Semesterwochenstunden (SWS) 6 ECTS (graded)
Enrollment: Apr 24, 2015
Send your three favorite topics to cindy.faehnrich(at)hpi.de until Wednesday, Apr 22, 11.59 PM (noon)
Receive your assigned topic from us until Thursday, Apr 23, 12 PM (noon)

Vision

In this seminar, you shall improve your skills to get familiar with a specific research topic on your own. We will give a brief introduction into the topic and then coach you throughout the semester while you work on a specific topic. Several presentations ensure that your presentation skills are improved, too. You can nominate for a list of topics, which will be presented during the first seminar meeting.

Grading

The final grading will be determined by the following individual parts, while each part must be passed at least (concrete percentages yet to be assigned):

Research article (40%),
Seminar results and their presentations, e.g. mid-term and final (40%), and
Individual commitment (20%).

Materials

Kick-off session (topics etc.): Slides
Introduction to applying in-memory technology in life sciences: Slides

Selected Seminar Topics

Customizing the Reference for Genome Data Analysis

Severe illnesses such as cancer have their root causes in mutations on particular positions in the genetic code. To identify these for a specific patient, a DNA sample, consisting of millions of unordered small DNA sequences, must be analyzed: First reordering the DNA sequences to form a whole human genome, i.e. alignment, and then identifying the genetic references, i.e. variant calling. Both steps require a reference genome for processing. Unfortunately, there do not exist several population-specific reference genomes, but rather one single reference that is a mix of multiple individuals, but not covering all population-groups. However, there might be population-specific differences in the genetic material that are relevant for analysis. Your task ist to equip the reference genome with population-specific genetic variants for one specific population. You then will execute the alignment analysis with the adapted reference and evaluate with your supervisor quantitative and qualitative differences between the results.

Evaluating Variant Calling Results with In-Memory Technology

Severe illnesses such as cancer have their root causes in mutations on particular positions in the genetic code. To identify these for a specific patient, a DNA sample is analyzed and compared to a reference in Variant Calling. There exist several tools for this step, all applying different underlying statistical models that potentially lead to differences in the result sets. Your task will be to develop a framework for evaluating variant calling results (two or more) to each other, by using an in-memory database. Those evaluations can be simple matchings and intersections between compared result sets, but should also include evaluating the quality of those genetic variants that are only in one of the result sets, e.g. by including knowledge from external data sources.

Variant Calling within an In-Memory Database

Severe illnesses such as cancer have their root causes in mutations on particular positions in the genetic code. To identify these for a specific patient, a DNA sample is analyzed and compared to a reference in Variant Calling. However, simply comparing the data to a single reference leads to inaccurate results because the data is error-prone and there are lots of genetic differences between populations that should be considered in calculations. Your task is to adapt variant calling to include different aspects such as known variants. You will discuss and identify relevant data sources/aspects with your supervisor, make it available within an in-memory database, and run variant detection directly inside the database. For this project, you build on an existing implementation for variant detection and refine its underlying statistical model.

Linking Medical Knowledge to Improve Precision Medicine

Huge amounts of medical insights are created nowdays, e.g. publications or medical guidelines. The challenge for doctors is to find the right medical puzzle peace at the right time. In cooperation with our cooperation partner, you will be able to acquire new requirements to extend the Medical Knowledge Cockpit for a concrete use case taken from oncology. As a result, you contribute to improve the treatment of cancer patients by enabling interactive access to relevant information to oncologists.

Interactive Data Explorer for the TCGA

The Cancer Genome Atlas (TCGA) provides data of cancer patients for researchers in a pseudonymized way. However, the exploration of the existing data requires manual download and import into relevant tools. You will be explore existing real patient data together with our cooperation partner and define requirements for an interactive analysis tool. You can build on existing funcationality created by HPI students to create an always up-to-data interactive exploration tool for TCGA data using in-memory technology. As a result, you provide oncology researchers a powerful tool for data analysis.

Integration and Harmonization of Medical Data

In the course of a clinical study various data sources are generated. In cooperation with our partner German Heart Institute Berlin you will explore data sources and propose a harmonized database model, which enables combination and interactive analysis of acquired data. You will provide examplarily analysis using existing tools or specifically designed tools for data exploration.

Analysis of Longitudinal Data

Healthcare insuranace data forms a longitudinal database of patient-specific events, e.g. treatments or medications. Identification of patterns and similar patient cases can help to improve treatment guidelines. Together with our cooperation partner you will analyze selected insurance data and apply pattern recognition algorithms, e.g. machine learning.

Gene-based text summarization

Biologists usually deal with long list of genes derived from microarrays experiments and they frequently need to search the vast scientific literature to learn more information on these entities. Automatic gene summarization systems have the potential to help biologists to better translate their findings to clinical benefits by providing short and useful descriptions of each of these genes. For instance, summaries can include information abour the functions of their corresponding proteins, known interactions to other genes and associations to diseases.

Sentiment analysis for controversional topics

Different studies frequently come to conflicting conclusions on the effect of food, medicaments and treatments on the human health. For instance, people are faced with the uncertainity on whether they are allowed to eat eggs daily, on the number of cups of coffee and glasses of wine which are recomendable and whether homeopaty can bring a cure to any disease at all. Text mining and sentiment analysis have the potential to automatically analyze the scientific literature to understand how these controversial topics have varied over the last years and the reasons for the changes.

General Information

Vision

Grading

Materials

Selected Seminar Topics

Customizing the Reference for Genome Data Analysis

Evaluating Variant Calling Results with In-Memory Technology

Variant Calling within an In-Memory Database

Linking Medical Knowledge to Improve Precision Medicine

Interactive Data Explorer for the TCGA

Integration and Harmonization of Medical Data

Analysis of Longitudinal Data

Gene-based text summarization

Sentiment analysis for controversional topics

Text mining on the gut microbiota and the human health

News

22.09.2023 | Trends and Concepts in the Softwareindustry Seminar offered in WiSe 2023/2024

22.05.2023 | Christopher Hagedorn Successfully Defended His PhD Thesis

03.03.2023 | Last Trends and Concepts course of Prof. Hasso Plattner

01.03.2023 | Jan Kossmann Successfully Defended His PhD Thesis

26.02.2023 | Paper on Data Tiering in Hyrise Published in BTW Proceedings

24.02.2023 | Paper on EPIC Research Group Published in SIGMOD Record

30.11.2022 | Paper on Database Optimizations for Spatio-Temporal Data published in PVLDB

04.10.2022 | Günter Hesse Successfully Defended His PhD Thesis

08.07.2022 | Successful PhD Defense by Markus Dreseler

Literature

Contact