Hasso-Plattner-Institut
Prof. Dr. h.c. mult. Hasso Plattner
 

Bachelor Project: Real-time Analysis of Genome Data

This Bachelor project focuses on the real time analysis of genome data and related information, such as known mutations, related diseases and medication or information from scientific papers. The project aims to apply in-memory technology for scientific data management. If you are interested in this project, you can download the detailed project description.

Motivation

The vision of the human genome project was born in the early 1980s. One decade later, it was officially started in the U.S. in 1990. Another decade later, a first draft of the human genome was announced in 2000. In the same period costs for computer hardware dropped and capacities of main memory and storage systems underwent an exponential growth. Today, DNA sequencing and genome analysis are turned into reality. For example, malicious tissue from tumor patients is analyzed to derive concrete treatment decisions in course of personalized medicine. Suspects at crime scenes are identified by DNA profiling. Optimized crops are selected based on the results of their genetic analysis to improve harvests in agriculture worldwide. All examples have in common: Genome data is huge and its analysis takes days to weeks. For example, the human genome consists of ~3.2 billion base pairs (= 3.2 GB) distributed across 23 chromosomes, building 20k-30k genes that code 50k-300k proteins. Genome data is a specific subset of scientific data. Data management for scientific data comes with various challenges, such as huge storage requirements, traditional scanning algorithms are based on reading sequences of characters from files, processing of operational data in databases is only rarely considered, parallelization of processing, etc.

Goal

Building on our long-lasting experience in applying in-memory technology to selected enterprise challenges, we also focus on processing and analyzing of scientific data sets in real-time. In particular, the applicability of in-memory technology for analysis of genome data will be evaluated. Proof of concept prototypes will be engineered and shown to real-world users in the course of this project.