In recent years, digital advances in bioinformatics allowed for the extensive gathering and processing of proteomics data. This data is usually obtained via mass spectrometry (MS) experiments that result in large amounts of high-dimensional data. Using current methods, this sheer mass of data is often hard to analyze due to time and resource restrictions. Thus, it is important to develop improved processing methods that are not only faster, but also possess adequate sensitivity, while keeping false discovery rates low. This is especially important as the SARS-CoV-2 pandemic brought on large-scale studies of unprecedented size for proteomics, which need to be analyzed thoroughly and efficiently.
Recently, methods from the field of machine learning have proven to be successful in the analysis of complex proteomics data, such as the detection and intensity estimation of peptide feature intensity. In this work, we want to apply machine learning methods to efficiently analyze large amounts of MS data and predict sample conditions, e.g., disease presence or progression, without sequence-dependent preprocessing. We will especially focus on the unbiased understanding of disease outcomes in SARS-CoV-2 infected patients and their prediction. For this, we will represent samples as image-like data structures so that they can be processed using fine-tuned deep-learning models from the computer vision domain. The pixels in these pseudo images represent the abundances of the detected peptides and thus provide unbiased raw features of proteins involved in the disease without prior knowledge of sequence or modification. This peptide information is the key to explaining the difference in course of disease and could lead to novel biomarkers.
With the help of our machine learning model, we hope to gain new insights into the SARS-CoV-2 disease, increase analysis efficiency and thus support the treatment of SARS-CoV-2 patients.