Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI
Login
 

Interpretability of interactions in genomic convolutional neural networks

Marta Lemanczyk

Ph.D student at Data Analytics and Computational Statistics Group

Contact Information

Office: F-E.08
Tel.: +49 331 5509-4975
Email: Marta.Lemanczyk(at)hpi.de

Supervisor: Prof. Dr. Bernhard Renard
 

Introduction

Deep neural networks are capable of learning non-linear interactions between features which have an impact on the network's decisions. It is still challenging to explain the decisions due to the nature of the neural network: a black-box model. Specifically for medical applications, it is of great importance to understand the decisions made for sensitive tasks. One direct application in the biomedical field is deep learning on genomic sequences. This field became more accessible during the last years due to novel techniques in Next Generation Sequencing. Convolutional Neural Networks (CNN) are popular for this kind of tasks because of their ability to learn patterns in the input space. One way to find relevant patterns for a specific prediction task is to calculate contribution scores for single nucleotides which form biological significant motifs. However, it is often not enough to explain the outcome only with important motifs. Biological mechanisms can also contain complex interactions between those motifs. Our research focuses on how to identify such interactions in genomic sequences learned from CNNs.

    Background

    One-hot encoding for genomic sequences.

    Sequence data and CNN's

    Genomic sequences are long strings with one of the 4 nucleotides (Adenine, Guanine, Cytosine, Thymine) at each position. To obtain a numeric representation, sequences are one-hot-encoded resulting in a matrix with the size sequence length x 4. These matrices can be then used as input for the network.

    CNN's main architecture is made of two different kinds of neural network layers.

    1. Convolutional layers consist of matrices (so-called filters or kernels) which learn local representations of patterns within the data that are relevant for the prediction task. In the case of genomic sequences, convolutional layers learn (sub-) motifs. Depending on the networks architecture, motifs are learned directly in the first layer or distributed among deeper layers.

    Convolutional neural network architecture for genomic sequences.

    2. Dense layers contain fully connected nodes where the last layer represents the network's output. These layers are often responsible for learning complex interactions between the patterns learned by the convolutional layers. We assume that interactions between multiple motifs are learned in dense layers.

    CNN's can be applied on classification as well as on regression tasks.

    ­

    Interactions

    An interaction in genomic sequences can be defined as a set of motifs which have biological relevance for a given task. The output is dependent on the presence or absence of motifs in this set. Interactions can be categorized as additive or multiplicative interactions. Additive interactions can be compared with linear combinations where a certain value for each motif is added to the output depending on the presence or absence of a motif. However, a motif's contribution to the output is independent from other motifs. In multiplicative interactions motifs have an additional shared contribution to the output which is dependent on the presence or absence of the other motifs from the interaction set.

    Contribution scores for a subsequence containing a motif.

    Interpretability of CNNs

    Neural networks usually contain a large number of weights making it difficult to explain what the model learned. In recent years, multiple attribution methods were developed to interpret neural networks on the input level. For each input instance, contribution scores are assigned to all input features based on how much a feature contributes to the models outcome. In the case of sequence data, a contribution score is assigned to each position in the sequence highlighting the importance of the given nucleotide. In this way, task-specific motifs can be detected for individual sequences. One remaining issue is that while these methods can display local interactions between features on the nucleotide level (= motif formation), it is difficult to identify interactions between motifs on a higher level. It is also not clear, how interactions can influence the attribution methods itself.

    Current Work

    We investigate how interactions in genomic sequence data can influence contribution scores assigned to individual positions within a genomic sequence and therefore the detection of motifs. For that, we compare models that only contain additive interactions with models trained on data with more complex multiplicative interactions. We hypothesize that due to dependencies between high-level features (here: motifs) complex multiplicative interactions are more difficult to interpret.

    Experiment Setup

    Genomic sequence data with interactions and known ground truth is often not available so we generate sequences containing real biological motifs from the JASPAR database. We include both classification and regression tasks. Different target labels for additive and multiplicative interactions are created for the same sequence data set to make the resulting models comparable. Additionally, a similar model performance between each model pair must be ensured so that differences in interpretability performance cannot be a led back to the model performance. To evaluate interpretability, we used the approach described by Koo et al. where AUPRC values are calculated for contribution scores with respect to the actual motif positions. This way, it can be quantified how well individual motifs can be captured by contribution scores.

    Preliminary results

    We can observe that despite similar predictive performances, attribution methods perform worse on more complex multiplicative interactions regarding interpretability and detected less motifs. It can be observed for all of the tested attribution methods (Integrated Gradients, DeepLIFT, DeepSHAP). This leads to the assumption that attribution methods could miss important motifs when applied on real-life data containing interactions regardless of model performance.

    Teaching

    In the winter term 2021/2022 I will co-organize the master seminar Verantwortung in der Informatik - Accountability in AI. I also co-organize the Schülerkolleg ('Introduction to programming', 7th-8th grade)

    Presentations

    • Poster Presentation at ISMB conference 2021: "Evaluation Of Convolutional Neural Networks Containing Interactions Between Genomic Motifs" (MLCSB & Representational Learning in Biology sessions) 
    • Presentation in the weekly DSE research school seminar (June 2021)

    Academic CV

    2020 - presentPh.D student at the chair for Data Analytics and Computational Statistics at Hasso Plattner Institute in Potsdam, Germany
    2017 - 2019

    Master in computer science at Philipps University in Marburg, Germany

    2013 - 2017

    Bachelor in computer science at Philipps University in Marburg, Germany