In recent years, the discipline of proteomics mass spectrometry (MS) made it feasible to process proteomics data on a large scale so that it starts to rival genomics in analysis depth and scale. The analysis of protein data is desirable because proteins are much closer to the phenotype than genes and transcripts, as these are the molecules that carry out the function of an organism.However, their behavior cannot easily be inferred from the genome alone because there are many regulatory steps before, between, and after transcription and translation that alter the structure and function of the encoded proteins.The regulatory events that take place after translation are of special interest to my research. These are called post-translational modifications (PTMs). Recent studies showed that these PTMs contribute to diseases such as cancer and Alzheimer's disease. Thus, analyzing these PTMs can make a big difference in the correctness of disease prediction algorithms.
Proteomics data are usually obtained through MS experiments that result in large amounts of high-dimensional data. Using current methods, this sheer mass of data is often hard to analyze because it is usually either a time- or resource-intensive process. These and other factors lead to algorithms being unable to identify the underlying peptide sequence for the a majority (<70%) of acquired tandem MS spectra in a standard sample. Thus, it is important to develop improved processing methods that are not only faster but possess adequate sensitivity while keeping false discovery rates low. In recent years, methods from the field of machine learning have proven to be successful in the analysis of complex proteomics data, such as the detection and intensity estimation of peptide feature intensity. The logical next step is to research methods that can also be used in a classification setting, such as disease prediction.