Hasso-Plattner-Institut
Prof. Dr. h.c. Hasso Plattner
  
 

Trends in Bioinformatics

General Information

Aim of the Seminar

This seminar introduces you into the broad area of bioinformatics - data types, standard analysis procedures, and applied tools. We want you to conduct research on a particular topic to identify and discuss advantages and drawbacks with the standard procedures, propose a solution (depending on the topic), and evaluate your approach by comparing it with state-of-the-art.

We expect you to present your findings seminar presentations and by writing a research paper. We will guide you throughout the whole semester and support you in improving your research, writing, and presentation skills.

Grading

The final grading will be determined by the following individual parts, while each part must be passed at least: 

  • Intermediate presentation, final presentation and abstract (40%)
  • Research article (40%)
  • Individual commitment (20%)

Schedule and Slides

Kick-Off Oct 16, 9.15 AM, D-E.9/10, Campus II Slides
Topic Selection Oct 24, 11.59 PM  
Topic Assignment Notification Oct 25, 1 PM  
Intermediate Presentations

Nov 27, 9.15 AM, D-E.9/10, Campus II

AND

Nov 29, 9.00 AM, V-2.16, Campus II

A1, A2, A5, B4, B6, C
Final Presentations

Jan 22, 9.15-11.15 AM, D-E.9/10, Campus II

AND

Jan 23, 1.30-3.00 PM, D-E.9/10, Campus II

 
Introduction into Scientific Writing Jan 29, 9.15 AM, D-E.9/10, Campus II  
Paper Submission Deadline Mar 10, 11.59 PM  
Notification of Reject or Accept w/o (Minor) Revisions Mar 18  
Submission of Camera-ready Version Mar 31, 11.59 PM IEEE Latex Template
OPTIONAL: Excursions concrete dates and locations tbd  

Topics - Overview

A. Analysis of RNAseq Data

Gene expression is the cell process by which information from specific sections of the DNA, i.e. genes, is synthesized to functional products, i.e. proteins, which are catalyzing the metabolic processes in our cells. RNA-Seq data comprises the expression levels of all genes that are expressed in a particular cell. Analyzing this kind of data can help researchers to better understand regulatory processes in a cell and characterize genes and their functions. For example, if different genes are expressed similarly in a cell, the hypothesis is that they are involved in the same regulatory process. However, those patterns first need to be found in the data.

Possible topics:

  1. Integrative Gene Selection 
  2. Association Rule Mining
  3. Integrative Gene Selection vs. Integrative Clustering
  4. Biological Evaluation of Marker Genes

Relevant Approaches and Reviews:

External Knowledge Integration for RNAseq Analysis

Pasquier, Nicolas, et al. "Mining gene expression data using domain knowledge." International Journal of Software and Informatics (IJSI) 2.2 (2008): 215-231. https://hal.archives-ouvertes.fr/file/index/docid/361427/filename/Mining_Gene_Expression_Data_using_Domain_Knowledge_IJSI_2008.pdf

Integrative Gene Selection

Cun, Yupeng, and Holger Fröhlich. "Biomarker gene signature discovery integrating network knowledge." Biology 1.1 (2012): 5-17. https://www.mdpi.com/2079-7737/1/1/5/htm

Hira, Zena M., and Duncan F. Gillies. "A review of feature selection and feature extraction methods applied on microarray data." Advances in bioinformatics 2015 (2015). https://www.hindawi.com/journals/abi/2015/198363/abs/

Ang, Jun Chin, et al. "Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection." IEEE/ACM transactions on computational biology and bioinformatics 13.5 (2016): 971-989. https://www.semanticscholar.org/paper/Supervised%2C-Unsupervised%2C-and-Semi-Supervised-A-on-Ang-Mirzal/a548edd4151e40a7a416e9921b5439bc0b937451

Guo, Zheng, et al. "Towards precise classification of cancers based on robust gene functional expression profiles." BMC bioinformatics 6.1 (2005): 58.https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-58

Association Rule Mining on Gene Expression Data

Naulaerts, Stefan, et al. "A primer to frequent itemset mining for bioinformatics." Briefings in bioinformatics 16.2 (2013): 216-231. https://academic.oup.com/bib/article/16/2/216/245744/A-primer-to-frequent-itemset-mining-for

Chen, Shu-Chuan, et al. "Dynamic association rules for gene expression data analysis." BMC genomics 16.1 (2015): 786. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4606551/pdf/12864_2015_Article_1970.pdf

Manda, Prashanti, et al. "Information Theoretic Interestingness Measures for Cross-Ontology Data Mining in the Mouse Anatomy Ontology and the Gene Ontology." America (2015). https://pdfs.semanticscholar.org/8d48/cf8af46710a933fed268334b65b8a27a0416.pdf

Clustering of Gene Expression Data

Bellazzi, Riccardo, and Blaž Zupan. "Towards knowledge-based gene expression data mining." Journal of biomedical informatics 40.6 (2007): 787-802. https://www.sciencedirect.com/science/article/pii/S1532046407000536

Cheng, Jill, et al. "A knowledge-based clustering algorithm driven by gene ontology." Journal of biopharmaceutical statistics 14.3 (2004): 687-700. https://www.tandfonline.com/doi/abs/10.1081/BIP-200025659

Evaluation

Subramanian, Aravind, et al. "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles." Proceedings of the National Academy of Sciences 102.43 (2005): 15545-15550. http://www.pnas.org/content/102/43/15545

B. Multi-level Data Integration in Systems Medicine of Heart Failure

The Systems Medicine Approach of Heart Failure (SMART) project has resulted in several levels of patient information: The DNA was sequenced to result in specific mutations and variations of the genome (SNP’s). RNA and protein measurements of heart tissue gave rise to expression data that describe the molecular make up of the heart tissue. In clinical screenings, e.g. magnetic resonance imaging, exercise testing, electro cardiography and many more, the patients’ phenotypes were assessed thoroughly. In combination, all datasets (clinicome, proteome, transcriptome, genome) comprise about > 3 Mio. variables per patient and could, when analyzed together, result in a wholistic understanding of heart failure. 

Several approaches exist to link two or more levels of information to derive mechanisms of the disease. The SMART data is ready to be analyzed and you may choose the approach from the following topics:

  1. Calculate and validate expression Quantitative Trait Loci (genome+transcriptome)
  2. Calculate and validate protein Quantitative Trait Loci (genome+proteome)
  3. Assess the feasibility of expressed QTLs (transcriptome+proteome)
  4. Bayesian Clustering of Multi-Omics (genome+transcriptome+proteome)
  5. Similarity Network Fusion on Multi-Omics (all data sets)
  6. Acceptance of the DEAME application for clinical research (transcriptome + clinicome)

General information and conceptual reviews: 

Trans-Omics: How To Reconstruct Biochemical Networks Across Multiple ‘Omic’ Layers, Yugi, K., (2016), Cell Press

The role of regulatory variation in complex traits and disease, Frank, W., (2015), Nature Reviews Genetics 

Enabling precision cardiology through multiscale biology and systems medicine, Johnson, K., (2017), JACC: Basic to Translational Science 2.3

Multi-omics clustering:

iClusterBayes

Similarity Network Fusion

DEAME application

 

C. Interpretability Approaches applied to Clinical Predictive Modeling

The field of machine learning has witnessed many advanced advances in the last decades, especially regarding recent developments with deep learning. However, this progress has not been translated into practice for application domains that require the possibility to understand 'why' specific predictions have been made, such as medicine. Different approaches exist that are aimed at lending intelligibility to sophisticated machine learning algorithms. In fact, practitioners recommend the use of different techniques in combination in other to obtain a more complete picture on the inner workings of specific algorithms.

Your task will consist in implementing, comparing and validating different interpretability approaches together with medical experts in the context of a given clinical question. Specifically, you will:

  1. Develop a clinical prediction in the context of Nephrology using Python sklearn
  2. Conduct literature research on applicable interpretability approaches in the clinical context
  3. Implement the identified methods using the CPM developed in step 1)
  4. Evaluate the chosen methods w.r.t different criteria, such as a) computational complexity, b) medical feedback and c) interpretability desiderata
  5. Identify key areas of improvement for current algorithms regarding interpretability approaches as applied in the clinical domain.
     

To get a good overview of the topic, refer to Lipton's work [1]. To have an idea of what a practical implementation looks like using mimic learning refer to Che and colleagues [2].  For a more comprehensive view on approaches available (this is a great start for the literature review), take a look at Hall & Gill [3]. Finally, For a more a formal treatment of the concept of interpretability, refer to Doshi-Velez & Kim [4].

[1] Lipton, Z. C. The Mythos of Model Interpretability (2016). Available at http://arxiv.org/abs/1606.03490

[2] Zhengping Che, Sanjay Purushotham, Robinder Khemani and Yan Liu: Interpretable Deep Models for ICU Outcome Prediction (2017). Available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5333206/

[3] Hall, P., & Gill, N. An Introduction to Machine Learning Interpretability: An Applied Perspective on Fairness, Accountability,Transparency, and Explainable AI.
O’Reilly M (2018). Available at http://www.oreilly.com/data/free/an-introduction-to-machine-learning-interpretability.csp

[4] Doshi-Velez, F., & Kim, B. Towards A Rigorous Science of Interpretable Machine Learning (2017). Available at https://arxiv.org/abs/1702.08608