Hasso-Plattner-Institut
Prof. Dr. h.c. mult. Hasso Plattner
 

Trends in Bioinformatics

General Information

Aim of the Seminar

This seminar introduces you into the broad area of bioinformatics - data types, standard analysis procedures, and applied tools. We want you to conduct research on a particular topic to identify and discuss advantages and drawbacks with the standard procedures, propose a solution (depending on the topic), and evaluate your approach by comparing it with the state-of-the-art bioinformatics tools.

We expect you to present your findings seminar presentations and by writing a research paper. We will guide you throughout the whole semester and support you in improving your research, writing, and presentation skills.

Grading

The final grading will be determined by the following individual parts, while each part must be passed at least: 

  • Intermediate presentation, final presentation and abstract (40%)
  • Research article (40%)
  • Individual commitment (20%)

Schedule and Slides

Kick-OffOct 17, 9.15 AM, D-E.9/10, Campus IISlides
Topic SelectionOct 25, 11.59 PM
Topic Assignment NotificationOct 26, 1 PM
Intermediate PresentationsDec 12, 9.15 AM, D-E.9/10, Campus IIA2A6C1C2
Abstracts DeadlineJan 30, 11.59 PMA2, A6, C1, C2
Final Presentations

Feb 6, 9.15 AM, D-E.9/10, Campus II

AND

Feb 9, 1.30 PM, V-1.15, Campus II

Intro, C1, C2

 

A2, A6

ExcursionsFeb 23, 9.00 AM - 1.00 PM, Gläsernes Labor Berlin Buch
Introduction into Scientific WritingJan 23, 9.15 AM, D-E.9/10, Campus IISlides
Report deadlineMar 31, 11.59 PMIEEE Latex Template

Topics - Overview

A. Data Mining on Gene Expression Data

  1. An Interestingness Measure for Gene Expression Associations
  2. Bi-Clustering with Biological Context Information
  3. Causal Inference of Gene Expression Data
  4. Verification of Gene Expression Patterns in Public Knowledge Bases
  5. Optimize Calling of Genetic Variants from RNAseq Data
  6. Clinical Interpretation of Omics Clustering Results
  7. Statistical Basis of Differential Gene Expression (DGE) Analysis

B. Text Mining for Biomedicine

  1. Extracting Scientific Entities and Relations from Publications to Support Searching for Alternative Methods to Animal Experiments

C. Prediction of Patient-Level Outcomes

  1. Prediction of Patient Outcomes after Renal Replacement Therapy (RRT) in the ICU
  2. Prediction of Acute Kidney Injury following Heart Surgery

A. Data Mining on Gene Expression Data

Gene expression is the cell process by which information from specific sections of the DNA, i.e. genes, is synthesized to functional products, i.e. proteins, which are catalyzing the metabolic processes in our cells. RNA-Seq data comprises the expression levels of all genes that are expressed in a particular cell. Analyzing this kind of data can help researchers to better understand regulatory processes in a cell and characterize genes and their functions. For example, if different genes are expressed similarly in a cell, the hypothesis is that they are involved in the same regulatory process. However, those patterns first need to be found in the data.

A1. An Interestingness Measure for Gene Expression Associations (Supervisor: Cindy Perscheid)

Association Rule Mining, or Itemset Mining, is applied on gene expression data to identify correlations between the expression levels of different genes. A derived rule would have the form of GeneA (up) —> GeneB (up), meaning that if GeneA is upregulated, then typically GeneB is upregulated as well. This information helps researchers to derive unknown gene functions and better understand regulatory processes in cells for different disease types. The amount of rules resulting from those analyses are typically filtered with standard interestingness measures, e.g. support and confidence. These measures are driven by statistical analyses of the data sets. However, the interestingness of a rule for gene expression data should also take into account its biological relevance, which can only be derived from external sources. There exist multiple knowledge bases with curated information on gene-gene/gene-disease associations that are publicly available. 

Your task will be to define an interestingness measure for gene expression data taking into consideration the biological relevance of a rule, e.g. refining an existing measure and including knowledge from an external data source. You will implement and test your approach with publicly available real-world data and compare the results to state-of-the-art approaches. 

A2. Bi-Clustering with Biological Context Information (Supervisor: Cindy Perscheid)

Clustering is currently the method of choice for analyzing gene expression data besides Differential Expression Analysis. In a clustering, expression profiles of genes are grouped together if they are similar, which can provide insights on (formerly unknown) gene functions by the assumption that genes that are similarly expressed are participating in the same molecular process. Genes are typically involved in multiple processes of a cell, which is not well reflected by a strict separation of genes into distinct clusters. Bi-clustering, or subspace clustering, addresses this problem by trying to find (overlapping) subspaces in the data. However, the identification of subspaces completely relies on analyzing the data set, while it is reasonable to include biological context into the analysis.

Your task will be to apply a bi-clustering algorithm to gene expression data and extend it to incorporate knowledge from external resources, e.g. pathway information. This way, the algorithm shall be able to better - and potentially faster - assess which genes fall into what group of clusters. 

A3. Causal Inference of Gene Expression Data (Supervisor: Cindy Perscheid, Johannes Hügle)

The expression of a gene is regulated by specific regions in the genetic code. Genetic variants in those regions thus can have a high influence on expression levels. Finding these (maybe former unknown) causal relationships allows researchers to derive the function of a whole genetic region and helps them to better understand the genetic code of humans.

In the recent years, causality has grown from a nebulous concept into a mathematical theory. While this emerging research area currently mainly concentrates on performance improvements, there is an enormous need to expand the flexibility of the corresponding algorithms. Therefore, your task will be to extend an existing algorithm for causal inference to deal with multiple types of data: Gene expression levels and genetic variants. Each data type requires a different statistical approach, which you will apply with our help to the data. The challenge here is to find an efficient computing strategy that incorporates the flexibility needed to infer causal relationships in genetic data.

A4. Verification of Gene Expression Patterns in Public Knowledge Bases (Supervisor: Cindy Perscheid)

Data mining strategies are more and more applied to gene expression data. However, results are typically validated manually, either by literature research - e.g. results from studies on the same data set - or by searching knowledge bases via keyword-search for the identified gene-gene or gene-disease correlations.

Your task will be to identify suitable resources for the validation of gene-gene/gene-disease correlations and implement a framework for the automatic validation of a given correlation. You will define ranking/evaluation criteria - e.g. reliability or support - to assess the credibility of researched resources and identified correlations. If time allows, this framework can be used by your fellow students to assess their analysis results.

A5. Optimize Calling of Genetic Variants from RNAseq Data (Supervisor: Milena Kraus)

RNA-seq is primarily considered a method of gene expression analysis but it can also be used to detect DNA variants in expressed regions of the genome. However, current variant callers do not generally behave well with RNA-seq data.

Therefore, the current gold standard „GATK Best practices“ pipeline can be optimized in terms of preprocessing (Opossum approach), performance of the pipeline (HalvadeRNA) and subsequent filtering steps (e.g. from SNiPR).

Up to three students can work on this topic, which will include a thorough understanding of the underlying biological context, of the used algorithms and their shortcomings, and an implementation of a new best practice strategy to call variants from RNAseq data.

We will provide at least a working implementation of the GATK best practices. To quantify and evaluate improvements, all approaches need to be assessed on the new benchmark data set “genome in a bottle”. 

A6. Clinical Interpretation of Omics Clustering Results (Supervisor: Milena Kraus)

Omics data, e.g. gene expression, are usually analysed using various clustering techniques. In an experimental setting samples should cluster according to the wetlab perturbation, e.g. treated with a chemical or drug vs. untreated. However, in purely observational patient studies clusters describe disease subgroups and the perturbation causing the clustering result is not obvious. Instead all clinical and environmental factors potentially contribute to the patient subgroups. In order to find contributing factors in the patient data, clinicians and researchers currently scan all parameters manually.

A possible solution are decision trees that can be trained on patient data and a label derived from the omics clustering result.  They provide a measure of importance of any given patient attribute. Relevant attributes are, e.g. a previous medication or an elevated hormone level and thus they provide detailed insight  into a disease.

Your task on this topic will include a thorough research on decision trees or other applicable methods and their application. We will provide a data set of clinical parameters (ongoing SMART study) and at least one clustering result from omics data. Our data set is rich, new and unique in its kind. Thus, your results will be of actual value for our clinical partners in terms of time saved and medical insights.

A7. Statistical basis of differential gene expression (DGE) analysis (Supervisor: Cindy Perscheid and Milena Kraus)

DGE analysis algorithms have developed rapidly over the past couple of years. The major difference between algorithms is the assumption of statistical distribution underlying the expression data. While first approaches assumed a poisson distribution (e.g. PoissonSeq) of reads, many subsequent and commonly used algorithms use a negative binomial distribution (e.g. DeSEQ). Two very recent nature papers introduce kallisto and sleuth as algorithms that are fundamentally different in their preprocessing of RNAseq data and the underlying statistical assumption for  subsequent DGE analysis.

Your task will include a thorough understanding of statistical methods in DGE analysis and a comparison of kallisto+sleuth and any other two methods. The evaluation of results is supposed to highlight advantages and shortcomings of all three methods.

B. Text Mining for Biomedicine

B1. Extracting Scientific Entities and Relations from Publications to Support Searching for Alternative Methods to Animal Experiments (Supervisor: Mariana Neves)

Before performing experiments with animals, researchers are required to carefully search the biomedical literature for alternative methods to animal experiments, e.g., in vitro instead of in vivo methods. At the Bundesinstitut für Risikobewertung (BfR), we investigate the development of novel alternative methods to animal experiments and mine the scientific literature for such methods. In this scenario, potentially relevant publications should address the same research goal as proposed in the in vivo publication but should describe an in vitro method. Current search engines (e.g., PubMed) mostly rely on keyword-based searches and do not identify the various elements (and their relationships) in a publication, i.e., the research goal, method, etc.

Your task will consist of identifying the elements in a scientific abstract (e.g., methods, material or process) and/or classifying the relationships between these. You will experiment with supervised learning algorithms for the named-entity recognition and/or relation extraction task. You will rely on available training and test data to support the machine learning experiments. The supervisor will provide assistance on the topic and on the implementation of the project.

C. Prediction of Patient-Level Outcomes

C1. Prediction of Patient Outcomes after Renal Replacement Therapy (RRT) in the ICU (Supervisor: Harry F. da Cruz)

AKI (Acute Kidney Injury) is a common occurrence for ICU (Intensive Care Unit) patients and is associated with increased mortality and complication rates. In such cases, renal replacement therapy (RRT) is needed to ensure patient survival. Predicting outcomes of RRT is therefore key to identifying patients at increased risk for mortality and/or end-stage renal disease (ESRD), as well as those that will likely best respond to therapy.

To this extent, influence factors such as RRT modality, time of RRT initiation and/or cessation, treatment parameters, comorbidities, ICU scores, lab values and urine output shall be analyzed. In particular, patient outcomes of interest are, among others, mortality (90-day and in-hospital), time for the recovery of renal function, development of ESRD, hemodynamic stability and length of ICU stay. 

Your task therefore is to develop and evaluate a clinical prediction model (CPM) based on factors mediating patient outcomes following RRT in the ICU by means of machine learning methods. To achieve this purpose, you will utilize established ML toolkits and have the opportunity to go through the whole process of developing a CPM: data extraction, feature engineering, algorithm selection, model development and validation. A model thus developed can be used as for foundation for a Clinical Decision Support System to aid in the therapy planning of kidney patients. 

C2. Prediction of Acute Kidney Injury following Heart Surgery (Supervisor: Harry F. da Cruz)

Kidneys and heart are deeply connected and are jointly responsible for critical systemic functions in the human body. As such, whenever one of the organs suffers an injury, the other will also be ultimately affected, a phenomenon termed by experts as “cardiorenal syndrome”. Heart patients often require surgery as treatment, for example valve replacement or bypass surgery. These surgical interventions also place a significant burden on the patient’s kidneys, which may lead to acute kidney injury (AKI), a condition associated with complications and poor patient outcomes. AKI following heart surgery is a relatively common occurrence in the ICU (intensive care unit), affecting 3 to 30% of patients.

Identifying patients who are at increased risk for AKI at the time of surgery is vital to enact kidney-protective measures before surgery and more closely monitor patients after the intervention has taken place. Analog to topic C1, here your task is to develop a clinical prediction model (CPM) to predict the risk of a patient developing AKI using established machine learning toolkits, taking into account a wide spectrum of clinical parameters.