Cohen, Sarel; Hershcovitch, Moshik; Taraz, Martin; Kißig, Otto; Wood, Andrew; Waddington, Daniel; Chin, Peter; Friedrich, Tobias Drug Repurposing using Link Prediction on Knowledge Graphs with Applications to Non-Volatile MemoryComplex Networks and their Applications (ComplexNetworks) 2021: 742–753
The active global SARS-CoV-2 pandemic caused more than 167 million cases and 3.4 million deaths worldwide. The development of completely new drugs for such a novel disease is a challenging, time intensive process and despite researchers around the world working on this task, no effective treatments have been developed yet. This emphasizes the importance of \emph{drug repurposing, where treatments are found among existing drugs that are meant for different diseases. A common approach to this is based on \emph{knowledge graphs, that condense relationships between entities like drugs, diseases and genes. Graph neural networks (GNNs) can then be used for the task at hand by predicting links in such knowledge graphs. Expanding on state-of-the-art GNN research, Doshi et al. recently developed the Dr-COVID model. We further extend their work using additional output interpretation strategies. The best aggregation strategy derives a top-100 ranking of candidate drugs, 32 of which currently being in COVID-19-related clinical trials. Moreover, we present an alternative application for the model, the generation of additional candidates based on a given pre-selection of drug candidates using collaborative filtering. In addition, we improved the implementation of the Dr-COVID model by significantly shortening the inference and pre-processing time by exploiting data-parallelism. As drug repurposing is a task that requires high computation and memory resources, we further accelerate the post-processing phase using a new emerging hardware --- we propose a new approach to leverage the use of high-capacity Non-Volatile Memory for aggregate drug ranking.
Bilò, Davide; Cohen, Sarel; Friedrich, Tobias; Schirneck, Martin Near-Optimal Deterministic Single-Source Distance Sensitivity OraclesEuropean Symposium on Algorithms (ESA) 2021: 18:1–18:17
Given a graph with a distinguished source vertex \(s\), the Single Source Replacement Paths (SSRP) problem is to compute and output, for any target vertex \(t\) and edge \(e\), the length \(d(s,t,e)\) of a shortest path from \(s\) to \(t\) that avoids a failing edge \(e\). A Single-Source Distance Sensitivity Oracle (Single-Source DSO) is a compact data structure that answers queries of the form \((t,e)\) by returning the distance \(d(s,t,e)\). We show how to compress the output of the SSRP problem on \(n\)-vertex, \(m\)-edge graphs with integer edge weights in the range \([1,M]\) into a deterministic Single-Source DSO that has size \(O(M^{1/2 n^{3/2})\) and query time \(\widetildeO(1)\). We prove that the space requirement is optimal (up to the word size). Our techniques can also handle vertex failures within the same bounds. Chechik and Cohen [SODA 2019] presented a combinatorial randomized \(\widetildeO(m\sqrt{n}+n^2)\) time SSRP algorithm for undirected and unweighted graphs. We derandomize their algorithm with the same asymptotic running time and apply our compression to obtain a deterministic Single-Source DSO with \(\widetildeO(m\sqrt{n}+n^2)\) preprocessing time, \(O(n^{3/2})\) space, and \(\widetildeO(1)\) query time. Our combinatorial Single-Source DSO has near-optimal space, preprocessing and query time for dense unweighted graphs, improving the preprocessing time by a \(\sqrt{n}\)-factor compared to previous results. Grandoni and Vassilevska Williams [FOCS 2012, TALG 2020] gave an algebraic randomized \(\widetildeO(Mn^\omega)\) time SSRP algorithm for (undirected and directed) graphs with integer edge weights in the range \([1,M]\), where \(\omega < 2.373\) is the matrix multiplication exponent. We derandomize their algorithm for undirected graphs and apply our compression to obtain an algebraic Single-Source DSO with \(\widetildeO(Mn^\omega)\) preprocessing time, \(O(M^{1/2 n^{3/2})\) space, and \(\widetildeO(1)\) query time. This improves the preprocessing time of algebraic Single-Source DSOs by polynomial factors compared to previous results. We also present further improvements of our Single-Source DSOs. We show that the query time can be reduced to a constant at the cost of increasing the size of the oracle to \(O(M^{1/3 n^{5/3})\) and that all our oracles can be made path-reporting. On sparse graphs with \(m=O(\frac{n^{5/4-\varepsilon}}{M^{7/4}})\) edges, for any constant \(\varepsilon > 0\), we reduce the preprocessing to randomized \(\widetildeO(M^7/8 m^1/2 n^{11/8}) = O(n^{2-\varepsilon/2})\) time. To the best of our knowledge, this is the first truly subquadratic time algorithm for building Single-Source DSOs on sparse graphs.
Berger, Julian; Bleidt, Tibor; Büßemeyer, Martin; Ding, Marcus; Feldmann, Moritz; Feuerpfeil, Moritz; Jacoby, Janusch; Schröter, Valentin; Sievers, Bjarne; Spranger, Moritz; Stadlinger, Simon; Wullenweber, Paul; Cohen, Sarel; Doskoč, Vanja; Friedrich, Tobias Fine-Grained Localization, Classification and Segmentation of Lungs with Various DiseasesCVPR Workshop on Fine-Grained Visual Categorization (FGVC@CVPR) 2021
The fine-grained localization and classification of various lung abnormalities is a challenging yet important task for combating diseases and, also, pandemics. In this paper, we present one way to detect and classify abnormalities within chest X-ray scans. In particular, we investigate the use of binary image classification (to distinguish between healthy and infected chests) and the weighted box fusion (which constructs a detection box using the proposed boxes within range). We observe that both methods increase the performance of a base model significantly. Furthermore, we improve state of the art on lung segmentation, even in the presence of abnormalities. We do so using transfer learning to fine-tune a UNet model on the Montgomery and Shenzhen datasets. In our experiments, we compare standard augmentations (like crop, pad, rotate, warp, zoom, brightness, and contrast variations) to more complex ones (for example, block masking and diffused noise augmentations). This way, we obtain a state-of-the-art model with a dice score of 97.9%. In particular, we show that simple augmentations outperform complex ones in our setting.
Wood, Andrew; Hershcovitch, Moshik; Waddington, Daniel; Cohen, Sarel; Chin, Peter Non-Volatile Memory Accelerated Posterior EstimationHigh Performance and Embedded Computing (HPEC) 2021
Bayesian inference allows machine learning models to express uncertainty. Current machine learning models use only a single learnable parameter combination when making predictions, and as a result are highly overconfident when their predictions are wrong. To use more learnable parameter combinations efficiently, these samples must be drawn from the posterior distribution. Unfortunately computing the posterior directly is infeasible, so often researchers approximate it with a well known distribution such as a Gaussian. In this paper, we show that through the use of high-capacity persistent storage, models whose posterior distribution was too big to approximate are now feasible, leading to improved predictions in downstream tasks.
Wood, Andrew; Hershcovitch, Moshik; Waddington, Daniel; Cohen, Sarel; Wolf, Meredith; Suh, Hongjun; Zong, Weiyu; Chin, Peter Non-Volatile Memory Accelerated Geometric Multi-Scale Resolution AnalysisHigh Performance and Embedded Computing (HPEC) 2021: 1–7
Dimensionality reduction algorithms are standard tools in a researcher's toolbox. Dimensionality reduction algorithms are frequently used to augment downstream tasks such as machine learning, data science, and also are exploratory methods for understanding complex phenomena. For instance, dimensionality reduction is commonly used in Biology as well as Neuroscience to understand data collected from biological subjects. However, dimensionality reduction techniques are limited by the von-Neumann architectures that they execute on. Specifically, data intensive algorithms such as dimensionality reduction techniques often require fast, high capacity, persistent memory which historically hardware has been unable to provide at the same time. In this paper, we present a re-implementation of an existing dimensionality reduction technique called Geometric Multi-Scale Resolution Analysis (GMRA) which has been accelerated via novel persistent memory technology called Memory Centric Active Storage (MCAS). Our implementation uses a specialized version of MCAS called PyMM that provides native support for Python datatypes including NumPy arrays and PyTorch tensors. We compare our PyMM implementation against a DRAM implementation, and show that when data fits in DRAM, PyMM offers competitive runtimes. When data does not fit in DRAM, our PyMM implementation is still able to process the data.
Bilò, Davide; Cohen, Sarel; Friedrich, Tobias; Schirneck, Martin Space-Efficient Fault-Tolerant Diameter OraclesMathematical Foundations of Computer Science (MFCS) 2021: 18:1–18:16
We design \(f\)-edge fault-tolerant diameter oracles (\(f\)-FDO, or simply FDO if \(f=1\)). For a given directed or undirected and possibly edge-weighted graph \(G\) with \(n\) vertices and \(m\) edges and a positive integer \(f\), we preprocess the graph and construct a data structure that, when queried with a set \(F\) of edges, where \(|F| \leq f\), returns the diameter of \(G - F\). An \(f\)-FDO has stretch \(\sigma \geq 1\) if the returned value \(\widehat D\) satisfies \(\operatornamediam(G - F) leq widehat D leq sigma \operatornamediam(G - F)\). For the case of a single edge failure (\(f=1\)) in an unweighted directed graph, there exists an approximate FDO by Henzinger et al. [ITCS 2017] with stretch \((1+\varepsilon)\), constant query time, space \(O(m)\), and a combinatorial preprocessing time of \(\widetildeO(mn + n^1.5 \sqrt{Dm/\varepsilon})\), where \(D\) is the diameter. We present a near-optimal FDO with the same stretch, query time, and space. It has a preprocessing time of \(\widetildeO(mn + n^2/\varepsilon)\), which is better for any constant \(\varepsilon > 0\). The preprocessing time nearly matches a conditional lower bound for combinatorial algorithms, also by Henzinger et al. When using fast matrix multiplication instead, we achieve a preprocessing time of \(\widetildeO(n^2.5794 + n^2/\varepsilon)\). We further prove an information-theoretic lower bound showing that any FDO with stretch better than \(3/2\) requires \(\Omega(m)\) bits of space. Thus, for constant \(0 < \varepsilon < 3/2\), our combinatorial \((1+ \varepsilon)\)-approximate FDO is near-optimal in all the parameters. In the case of multiple edge failures (\(f>1\)) in undirected graphs with non-negative edge weights, we give an \(f\)-FDO with stretch \((f+2)\), query time \(O(f^2\log^2{n})\), \(\widetildeO(fn)\) space, and preprocessing time \(\widetildeO(fm)\). We complement this with a lower bound excluding any finite stretch in \(o(fn)\) space. Many real-world networks have polylogarithmic diameter. We show that for those graphs and up to \(f = o(\log n/ \log\log n)\) failures one can swap approximation for query time and space. We present an exact combinatorial \(f\)-FDO with preprocessing time \(mn^{1+o(1)}\), query time \(n^{o(1)}\), and space \(n^{2+o(1)}\). With fast matrix multiplication, the preprocessing time can be improved to \(n^{\omega+o(1)}\), where \(\omega < 2.373\) is the matrix multiplication exponent.
Kißig, Otto; Taraz, Martin; Cohen, Sarel; Doskoč, Vanja; Friedrich, Tobias Drug Repurposing for Multiple COVID Strains using Collaborative FilteringICLR Workshop on Machine Learning for Preventing and Combating Pandemics (MLPCP@ICLR) 2021
The ongoing COVID-19 pandemic demands for a swift discovery of suitable treatments. The development of completely new compounds for such a novel disease is a challenging, time intensive process. This amplifies the relevance of drug repurposing, a technique where existing drugs are used to treat other diseases. A common bioinformatical approach to this is based on knowledge graphs, which compile relationships between drugs, diseases, genes and other biomedical entities. Then, graph neural networks (GNNs) are used for the drug repurposing task as they provide a good link prediction performance on such knowledge graphs. Building on state-of-the-art GNN research, Doshi & Chepuri (2020) construct the remarkable model DR-COVID. We re-implement their model and extend the approach to perform significantly better. We propose and evaluate several strategies for the aggregation of link predictions into drug recommendation rankings. With the help of clustering of similar target diseases we improve the model by a substantial margin, compiling a top-100 ranking of candidates including 32 currently being in COVID-19-related clinical trials. Regarding the re-implementation, we offer more flexibility in the selection of the graph neighborhood sizes fed into the model and reduce the training time significantly by making use of data parallelism.
Kißig, Otto; Taraz, Martin; Cohen, Sarel; Friedrich, Tobias Drug Repurposing Using Link Prediction on Knowledge GraphsICML Workshop on Computational Biology (WCB@ICML) 2021: 742–753
The active global SARS-CoV-2 pandemic caused more than 167 million cases and 3.4 million deaths worldwide. The development of completely new drugs for such a novel disease is a challenging, time intensive process and despite researchers around the world working on this task, no effective treatments have been developed yet. This emphasizes the importance of drug repurposing, where treatments are found among existing drugs that are meant for different diseases. A common approach to this is based on knowledge graphs, that condense relationships between entities like drugs, diseases and genes. Graph neural networks (GNNs) can then be used for the task at hand by predicting links in such knowledge graphs. Expanding on state-of-the-art GNN research, Doshi et al. recently developed the Dr-COVID model. We further extend their work using additional output interpretation strategies. The best aggregation strategy derives a top-100 ranking of candidate drugs, 32 of which currently being in COVID-19-related clinical trials. Moreover, we present an alternative application for the model, the generation of additional candidates based on a given pre-selection of drug candidates using collaborative filtering. In addition, we improved the implementation of the Dr-COVID model by significantly shortening the inference and pre-processing time by exploiting data-parallelism. As drug repurposing is a task that requires high computation and memory resources, we further accelerate the post-processing phase using a new emerging hardware—we propose a new approach to leverage the use of high-capacity Non-Volatile Memory for aggregate drug ranking.