A. Analysis of RNAseq Data
Gene expression is the cell process by which information from specific sections of the DNA, i.e. genes, is synthesized to functional products, i.e. proteins, which are catalyzing the metabolic processes in our cells. RNA-Seq data comprises the expression levels of all genes that are expressed in a particular cell. Analyzing this kind of data can help researchers to better understand regulatory processes in a cell and characterize genes and their functions. For example, if different genes are expressed similarly in a cell, the hypothesis is that they are involved in the same regulatory process. However, those patterns first need to be found in the data.
Possible topics:
- Integrative Gene Selection
- Association Rule Mining
- Integrative Gene Selection vs. Integrative Clustering
- Biological Evaluation of Marker Genes
Relevant Approaches and Reviews:
External Knowledge Integration for RNAseq Analysis
Pasquier, Nicolas, et al. "Mining gene expression data using domain knowledge." International Journal of Software and Informatics (IJSI) 2.2 (2008): 215-231. https://hal.archives-ouvertes.fr/file/index/docid/361427/filename/Mining_Gene_Expression_Data_using_Domain_Knowledge_IJSI_2008.pdf
Integrative Gene Selection
Cun, Yupeng, and Holger Fröhlich. "Biomarker gene signature discovery integrating network knowledge." Biology 1.1 (2012): 5-17. https://www.mdpi.com/2079-7737/1/1/5/htm
Hira, Zena M., and Duncan F. Gillies. "A review of feature selection and feature extraction methods applied on microarray data." Advances in bioinformatics 2015 (2015). https://www.hindawi.com/journals/abi/2015/198363/abs/
Ang, Jun Chin, et al. "Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection." IEEE/ACM transactions on computational biology and bioinformatics 13.5 (2016): 971-989. https://www.semanticscholar.org/paper/Supervised%2C-Unsupervised%2C-and-Semi-Supervised-A-on-Ang-Mirzal/a548edd4151e40a7a416e9921b5439bc0b937451
Guo, Zheng, et al. "Towards precise classification of cancers based on robust gene functional expression profiles." BMC bioinformatics 6.1 (2005): 58.https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-58
Association Rule Mining on Gene Expression Data
Naulaerts, Stefan, et al. "A primer to frequent itemset mining for bioinformatics." Briefings in bioinformatics 16.2 (2013): 216-231. https://academic.oup.com/bib/article/16/2/216/245744/A-primer-to-frequent-itemset-mining-for
Chen, Shu-Chuan, et al. "Dynamic association rules for gene expression data analysis." BMC genomics 16.1 (2015): 786. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4606551/pdf/12864_2015_Article_1970.pdf
Manda, Prashanti, et al. "Information Theoretic Interestingness Measures for Cross-Ontology Data Mining in the Mouse Anatomy Ontology and the Gene Ontology." America (2015). https://pdfs.semanticscholar.org/8d48/cf8af46710a933fed268334b65b8a27a0416.pdf
Clustering of Gene Expression Data
Bellazzi, Riccardo, and Blaž Zupan. "Towards knowledge-based gene expression data mining." Journal of biomedical informatics 40.6 (2007): 787-802. https://www.sciencedirect.com/science/article/pii/S1532046407000536
Cheng, Jill, et al. "A knowledge-based clustering algorithm driven by gene ontology." Journal of biopharmaceutical statistics 14.3 (2004): 687-700. https://www.tandfonline.com/doi/abs/10.1081/BIP-200025659
Evaluation
Subramanian, Aravind, et al. "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles." Proceedings of the National Academy of Sciences 102.43 (2005): 15545-15550. http://www.pnas.org/content/102/43/15545
B. Multi-level Data Integration in Systems Medicine of Heart Failure
The Systems Medicine Approach of Heart Failure (SMART) project has resulted in several levels of patient information: The DNA was sequenced to result in specific mutations and variations of the genome (SNP’s). RNA and protein measurements of heart tissue gave rise to expression data that describe the molecular make up of the heart tissue. In clinical screenings, e.g. magnetic resonance imaging, exercise testing, electro cardiography and many more, the patients’ phenotypes were assessed thoroughly. In combination, all datasets (clinicome, proteome, transcriptome, genome) comprise about > 3 Mio. variables per patient and could, when analyzed together, result in a wholistic understanding of heart failure.
Several approaches exist to link two or more levels of information to derive mechanisms of the disease. The SMART data is ready to be analyzed and you may choose the approach from the following topics:
- Calculate and validate expression Quantitative Trait Loci (genome+transcriptome)
- Calculate and validate protein Quantitative Trait Loci (genome+proteome)
- Assess the feasibility of expressed QTLs (transcriptome+proteome)
- Bayesian Clustering of Multi-Omics (genome+transcriptome+proteome)
- Similarity Network Fusion on Multi-Omics (all data sets)
- Acceptance of the DEAME application for clinical research (transcriptome + clinicome)
General information and conceptual reviews:
Trans-Omics: How To Reconstruct Biochemical Networks Across Multiple ‘Omic’ Layers, Yugi, K., (2016), Cell Press
The role of regulatory variation in complex traits and disease, Frank, W., (2015), Nature Reviews Genetics
Enabling precision cardiology through multiscale biology and systems medicine, Johnson, K., (2017), JACC: Basic to Translational Science 2.3
Multi-omics clustering:
iClusterBayes
Similarity Network Fusion
DEAME application
C. Interpretability Approaches applied to Clinical Predictive Modeling
The field of machine learning has witnessed many advanced advances in the last decades, especially regarding recent developments with deep learning. However, this progress has not been translated into practice for application domains that require the possibility to understand 'why' specific predictions have been made, such as medicine. Different approaches exist that are aimed at lending intelligibility to sophisticated machine learning algorithms. In fact, practitioners recommend the use of different techniques in combination in other to obtain a more complete picture on the inner workings of specific algorithms.
Your task will consist in implementing, comparing and validating different interpretability approaches together with medical experts in the context of a given clinical question. Specifically, you will:
- Develop a clinical prediction in the context of Nephrology using Python sklearn
- Conduct literature research on applicable interpretability approaches in the clinical context
- Implement the identified methods using the CPM developed in step 1)
- Evaluate the chosen methods w.r.t different criteria, such as a) computational complexity, b) medical feedback and c) interpretability desiderata
- Identify key areas of improvement for current algorithms regarding interpretability approaches as applied in the clinical domain.
To get a good overview of the topic, refer to Lipton's work [1]. To have an idea of what a practical implementation looks like using mimic learning refer to Che and colleagues [2]. For a more comprehensive view on approaches available (this is a great start for the literature review), take a look at Hall & Gill [3]. Finally, For a more a formal treatment of the concept of interpretability, refer to Doshi-Velez & Kim [4].
[1] Lipton, Z. C. The Mythos of Model Interpretability (2016). Available at http://arxiv.org/abs/1606.03490
[2] Zhengping Che, Sanjay Purushotham, Robinder Khemani and Yan Liu: Interpretable Deep Models for ICU Outcome Prediction (2017). Available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5333206/
[3] Hall, P., & Gill, N. An Introduction to Machine Learning Interpretability: An Applied Perspective on Fairness, Accountability,Transparency, and Explainable AI.
O’Reilly M (2018). Available at http://www.oreilly.com/data/free/an-introduction-to-machine-learning-interpretability.csp
[4] Doshi-Velez, F., & Kim, B. Towards A Rigorous Science of Interpretable Machine Learning (2017). Available at https://arxiv.org/abs/1702.08608