Trends in Bioinformatics

General Information

UPDATE: Submit your papers until Mar 10, 2019 (11.59 PM) via the EasyChair submission system: https://easychair.org/conferences/?conf=tib2019
KickOff: Tue Oct 16, 9.15 AM, D-E.9/10, Campus II
Please send your top 3 topics until Wed Oct 24, 11.59PM to cindy.perscheid(at)hpi.de
You will be informed on your topic assignment by Thu Oct 25, 1PM
Teaching staff: Cindy Perscheid, Milena Kraus, Harry Freitas Da Cruz, Dr. Matthias Uflacker
Credits: 6 ECTS (graded), 4 Semesterwochenstunden (SWS)
Contact Email: cindy.perscheid(at)hpi.de, milena.kraus(at)hpi.de, harry.freitasdacruz(at)hpi.de

Aim of the Seminar

This seminar introduces you into the broad area of bioinformatics - data types, standard analysis procedures, and applied tools. We want you to conduct research on a particular topic to identify and discuss advantages and drawbacks with the standard procedures, propose a solution (depending on the topic), and evaluate your approach by comparing it with state-of-the-art.

We expect you to present your findings seminar presentations and by writing a research paper. We will guide you throughout the whole semester and support you in improving your research, writing, and presentation skills.

Grading

The final grading will be determined by the following individual parts, while each part must be passed at least:

Intermediate presentation, final presentation (40%)
Research article (40%)
Individual commitment (20%)

Schedule and Slides

Kick-Off	Oct 16, 9.15 AM, D-E.9/10, Campus II	Slides
Topic Selection	Oct 24, 11.59 PM
Topic Assignment Notification	Oct 25, 1 PM
Intermediate Presentations	Nov 27, 9.15 AM, D-E.9/10, Campus II AND Nov 29, 9.00 AM, V-2.16, Campus II	A1, A2, A5, B4, B6, C
Final Presentations	Jan 22, 9.15-11.15 AM, D-E.9/10, Campus II AND Jan 23, 1.30-3.00 PM, D-E.9/10, Campus II	A1, A2, A5, B4, C
Introduction into Scientific Writing	Jan 29, 9.15 AM, D-E.9/10, Campus II	Slides
Paper Submission Deadline	Mar 10, 11.59 PM	IEEE Latex Template
Notification of Reject or Accept w/o (Minor) Revisions	Mar 18
Submission of Camera-ready Version	Mar 31, 11.59 PM

Topics - Overview

A. Analysis of RNAseq Data

Gene expression is the cell process by which information from specific sections of the DNA, i.e. genes, is synthesized to functional products, i.e. proteins, which are catalyzing the metabolic processes in our cells. RNA-Seq data comprises the expression levels of all genes that are expressed in a particular cell. Analyzing this kind of data can help researchers to better understand regulatory processes in a cell and characterize genes and their functions. For example, if different genes are expressed similarly in a cell, the hypothesis is that they are involved in the same regulatory process. However, those patterns first need to be found in the data.

Possible topics:

Integrative Gene Selection
Association Rule Mining
Integrative Gene Selection vs. Integrative Clustering
Biological Evaluation of Marker Genes

Relevant Approaches and Reviews:

External Knowledge Integration for RNAseq Analysis

Pasquier, Nicolas, et al. "Mining gene expression data using domain knowledge." International Journal of Software and Informatics (IJSI) 2.2 (2008): 215-231. https://hal.archives-ouvertes.fr/file/index/docid/361427/filename/Mining_Gene_Expression_Data_using_Domain_Knowledge_IJSI_2008.pdf

Integrative Gene Selection

Cun, Yupeng, and Holger Fröhlich. "Biomarker gene signature discovery integrating network knowledge." Biology 1.1 (2012): 5-17. https://www.mdpi.com/2079-7737/1/1/5/htm

Hira, Zena M., and Duncan F. Gillies. "A review of feature selection and feature extraction methods applied on microarray data." Advances in bioinformatics 2015 (2015). https://www.hindawi.com/journals/abi/2015/198363/abs/

Ang, Jun Chin, et al. "Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection." IEEE/ACM transactions on computational biology and bioinformatics 13.5 (2016): 971-989. https://www.semanticscholar.org/paper/Supervised%2C-Unsupervised%2C-and-Semi-Supervised-A-on-Ang-Mirzal/a548edd4151e40a7a416e9921b5439bc0b937451

Guo, Zheng, et al. "Towards precise classification of cancers based on robust gene functional expression profiles." BMC bioinformatics 6.1 (2005): 58.https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-58

Association Rule Mining on Gene Expression Data

Naulaerts, Stefan, et al. "A primer to frequent itemset mining for bioinformatics." Briefings in bioinformatics 16.2 (2013): 216-231. https://academic.oup.com/bib/article/16/2/216/245744/A-primer-to-frequent-itemset-mining-for

Chen, Shu-Chuan, et al. "Dynamic association rules for gene expression data analysis." BMC genomics 16.1 (2015): 786. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4606551/pdf/12864_2015_Article_1970.pdf

Manda, Prashanti, et al. "Information Theoretic Interestingness Measures for Cross-Ontology Data Mining in the Mouse Anatomy Ontology and the Gene Ontology." America (2015). https://pdfs.semanticscholar.org/8d48/cf8af46710a933fed268334b65b8a27a0416.pdf

Clustering of Gene Expression Data

Bellazzi, Riccardo, and Blaž Zupan. "Towards knowledge-based gene expression data mining." Journal of biomedical informatics 40.6 (2007): 787-802. https://www.sciencedirect.com/science/article/pii/S1532046407000536

Cheng, Jill, et al. "A knowledge-based clustering algorithm driven by gene ontology." Journal of biopharmaceutical statistics 14.3 (2004): 687-700. https://www.tandfonline.com/doi/abs/10.1081/BIP-200025659

Evaluation

Subramanian, Aravind, et al. "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles." Proceedings of the National Academy of Sciences 102.43 (2005): 15545-15550. http://www.pnas.org/content/102/43/15545

B. Multi-level Data Integration in Systems Medicine of Heart Failure

The Systems Medicine Approach of Heart Failure (SMART) project has resulted in several levels of patient information: The DNA was sequenced to result in specific mutations and variations of the genome (SNP’s). RNA and protein measurements of heart tissue gave rise to expression data that describe the molecular make up of the heart tissue. In clinical screenings, e.g. magnetic resonance imaging, exercise testing, electro cardiography and many more, the patients’ phenotypes were assessed thoroughly. In combination, all datasets (clinicome, proteome, transcriptome, genome) comprise about > 3 Mio. variables per patient and could, when analyzed together, result in a wholistic understanding of heart failure.

Several approaches exist to link two or more levels of information to derive mechanisms of the disease. The SMART data is ready to be analyzed and you may choose the approach from the following topics:

Calculate and validate expression Quantitative Trait Loci (genome+transcriptome)
Calculate and validate protein Quantitative Trait Loci (genome+proteome)
Assess the feasibility of expressed QTLs (transcriptome+proteome)
Bayesian Clustering of Multi-Omics (genome+transcriptome+proteome)
Similarity Network Fusion on Multi-Omics (all data sets)
Acceptance of the DEAME application for clinical research (transcriptome + clinicome)

General information and conceptual reviews:

Trans-Omics: How To Reconstruct Biochemical Networks Across Multiple ‘Omic’ Layers, Yugi, K., (2016), Cell Press

The role of regulatory variation in complex traits and disease, Frank, W., (2015), Nature Reviews Genetics

Enabling precision cardiology through multiscale biology and systems medicine, Johnson, K., (2017), JACC: Basic to Translational Science 2.3

Multi-omics clustering:

iClusterBayes

Similarity Network Fusion

DEAME application

C. Interpretability Approaches applied to Clinical Predictive Modeling

The field of machine learning has witnessed many advanced advances in the last decades, especially regarding recent developments with deep learning. However, this progress has not been translated into practice for application domains that require the possibility to understand 'why' specific predictions have been made, such as medicine. Different approaches exist that are aimed at lending intelligibility to sophisticated machine learning algorithms. In fact, practitioners recommend the use of different techniques in combination in other to obtain a more complete picture on the inner workings of specific algorithms.

Your task will consist in implementing, comparing and validating different interpretability approaches together with medical experts in the context of a given clinical question. Specifically, you will:

Develop a clinical prediction in the context of Nephrology using Python sklearn
Conduct literature research on applicable interpretability approaches in the clinical context
Implement the identified methods using the CPM developed in step 1)
Evaluate the chosen methods w.r.t different criteria, such as a) computational complexity, b) medical feedback and c) interpretability desiderata
Identify key areas of improvement for current algorithms regarding interpretability approaches as applied in the clinical domain.

To get a good overview of the topic, refer to Lipton's work [1]. To have an idea of what a practical implementation looks like using mimic learning refer to Che and colleagues [2]. For a more comprehensive view on approaches available (this is a great start for the literature review), take a look at Hall & Gill [3]. Finally, For a more a formal treatment of the concept of interpretability, refer to Doshi-Velez & Kim [4].

[1] Lipton, Z. C. The Mythos of Model Interpretability (2016). Available at http://arxiv.org/abs/1606.03490

[2] Zhengping Che, Sanjay Purushotham, Robinder Khemani and Yan Liu: Interpretable Deep Models for ICU Outcome Prediction (2017). Available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5333206/

[3] Hall, P., & Gill, N. An Introduction to Machine Learning Interpretability: An Applied Perspective on Fairness, Accountability,Transparency, and Explainable AI.
O’Reilly M (2018). Available at http://www.oreilly.com/data/free/an-introduction-to-machine-learning-interpretability.csp

[4] Doshi-Velez, F., & Kim, B. Towards A Rigorous Science of Interpretable Machine Learning (2017). Available at https://arxiv.org/abs/1702.08608

Trends in Bioinformatics

General Information

Aim of the Seminar

Grading

Schedule and Slides

Topics - Overview

News

22.09.2023 | Trends and Concepts in the Softwareindustry Seminar offered in WiSe 2023/2024

22.05.2023 | Christopher Hagedorn Successfully Defended His PhD Thesis

03.03.2023 | Last Trends and Concepts course of Prof. Hasso Plattner

01.03.2023 | Jan Kossmann Successfully Defended His PhD Thesis

26.02.2023 | Paper on Data Tiering in Hyrise Published in BTW Proceedings

24.02.2023 | Paper on EPIC Research Group Published in SIGMOD Record

30.11.2022 | Paper on Database Optimizations for Spatio-Temporal Data published in PVLDB

04.10.2022 | Günter Hesse Successfully Defended His PhD Thesis

08.07.2022 | Successful PhD Defense by Markus Dreseler

Literature

Contact