Marta Lemanczyk, M.Sc

Doctoral student at Data Analytics and Computational Statistics Group

Topic: Interpretability of interactions in genomic convolutional neural networks

Contact

Office:	K-E.16
Phone:	+49 331 5509 - 4975
E-Mail:	Marta.Lemanczyk(at)hpi.de
Twitter:	@m_lemanczyk

Introduction

Deep neural networks are capable of learning non-linear interactions between features which have an impact on the network's decisions. It is still challenging to explain the decisions due to the nature of the neural network: a black-box model. Specifically for medical applications, it is of great importance to understand the decisions made for sensitive tasks. One direct application in the biomedical field is deep learning on genomic sequences. This field became more accessible during the last years due to novel techniques in Next Generation Sequencing. Convolutional Neural Networks (CNN) are popular for this kind of tasks because of their ability to learn patterns in the input space. One way to find relevant patterns for a specific prediction task is to calculate contribution scores for single nucleotides which form biological significant motifs with the help of post-hoc interpretability methods. However, it is often not enough to explain the outcome only with important motifs since biological mechanisms can also contain complex interactions between those motifs. Additionally, it is not fully understood how dependencies between features, or in this case higher-level interactions, can affect the performance of these methods. My research focuses on the effects of such interactions between motifs on post-hoc interpretability methods.

Background

Sequence data and CNN's

Genomic sequences are long strings with one of the 4 nucleotides (Adenine, Guanine, Cytosine, Thymine) at each position. To obtain a numeric representation, sequences are one-hot-encoded resulting in a matrix with the size sequence length x 4. These matrices can be then used as input for the network.

CNN's main architecture is made of two different kinds of neural network layers.

1. Convolutional layers consist of matrices (so-called filters or kernels) which learn local representations of patterns within the data that are relevant for the prediction task. In the case of genomic sequences, convolutional layers learn (sub-) motifs. Depending on the networks architecture, motifs are learned directly in the first layer or distributed among deeper layers.

Convolutional neural network architecture for genomic sequences.

2. Dense layers contain fully connected nodes where the last layer represents the network's output. These layers are often responsible for learning complex interactions between the patterns learned by the convolutional layers. We assume that interactions between multiple motifs are learned in dense layers.

CNN's can be applied on classification as well as on regression tasks.

Interactions

An interaction in genomic sequences can be defined as a set of motifs which have biological relevance for a given task. The output is dependent on the presence or absence of motifs in this set. Interactions can be categorized as additive or multiplicative interactions. Additive interactions can be compared with linear combinations where a certain value for each motif is added to the output depending on the presence or absence of a motif. However, a motif's contribution to the output is independent from other motifs. In multiplicative interactions motifs have an additional shared contribution to the output which is dependent on the presence or absence of the other motifs from the interaction set.

Contribution scores for a subsequence containing a motif.

Interpretability of CNNs

Neural networks usually contain a large number of weights making it difficult to explain what the model learned. In recent years, multiple attribution methods were developed to interpret neural networks on the input level. For each input instance, contribution scores are assigned to all input features based on how much a feature contributes to the models outcome. In the case of sequence data, a contribution score is assigned to each position in the sequence highlighting the importance of the given nucleotide. In this way, task-specific motifs can be detected for individual sequences. One remaining issue is that while these methods can display local interactions between features on the nucleotide level (= motif formation), it is difficult to identify interactions between motifs on a higher level. It is also not clear, how interactions can influence the attribution methods itself.

Current Work

We investigate how interactions in genomic sequence data can influence contribution scores assigned to individual positions within a genomic sequence and therefore the detection of motifs. For that, we compare models that only contain additive interactions with models trained on data with more complex multiplicative interactions. We hypothesize that due to dependencies between high-level features (here: motifs) complex multiplicative interactions are more difficult to interpret.

Experimental Setup

The main approach is to investigate differences between models containing interactions and models without interactions. Genomic sequence data with interactions and known ground truth is often not available. Therefore, we formalized possible interactions and generate sequences containing real biological motifs from the JASPAR database based on that definitions. Another obstacle is to design model architectures where design choices influence interpretability as little as possible (eg. distributed vs. local pattern learning). We include both classification and regression tasks. To evaluate interpretability, we used the approach described by Koo et al. where AUPRC values are calculated for contribution scores with respect to the actual motif positions. This way, it can be quantified how well individual motifs can be captured by contribution scores.

Preliminary results

We can observe that despite similar predictive performances, attribution methods perform worse on more complex multiplicative interactions regarding interpretability and detected less motifs. This leads to the assumption that attribution methods could miss important motifs when applied on real-life data containing interactions regardless of model performance.

Teaching

Winter term 2022/2023:

Master seminar: From fairness to cyberbiosecurity: accountability in machine learning for biology and medicine

Summer term 2022:

Master seminar: Mishaps in Statistics and ML

Winter term 2021/2022:

Master seminar Verantwortung in der Informatik - Accountability in AI
Schülerkolleg ('Introduction to programming', 7^th-8^th grade)

Presentations (2022)

Lightning Talk at the Trustworthy ML 2nd Anniversary Symposium
Poster presentation at the Eurpoean Conference on Computational Biology (ECCB) : "Influence of motif interactions on post-hoc attribution methods in genomic CNN's"
Talk at the Joint Workshop of the German Research Training Groups in Computer Science in Dagstuhl

Publications

[In Prep] Interpretability of motif interactions in genomic convolutional neural networks
Hauschild, A. C., Lemanczyk, M., Matschinske, J., Frisch, T., Zolotareva, O., Holzinger, A., ... & Heider, D. (2022). Federated Random Forests can improve local performance of predictive models for various healthcare applications. Bioinformatics, 38(8), 2278-2286. https://doi.org/10.1093/bioinformatics/btac065

Other Activities

Co-organizer Trustworthy ML Initiative
ITU/WHO Focus Group on Artificial Intelligence for Health (FG-AI4H): Implementation of trial audits
Junior chair for roundtable at ML4H Symposium on "Post-approval monitoring and validation of AI systems in health care" (Co-located with NeurIPS in New Orleans, Nov 2022)
Lipari Summer School Computational Complex and Social Systems - DATA SCIENCE: Models, Algorithms, AI and Beyond (Jul 2022)

Academic CV

2020 - present	Ph.D student at the chair for Data Analytics and Computational Statistics at Hasso Plattner Institute in Potsdam, Germany
2017 - 2019	Master in computer science at Philipps University in Marburg, Germany
2014	Semester abroad at University of Vermont, USA
2013 - 2017	Bachelor in computer science at Philipps University in Marburg, Germany