Computational methods for characterization of the human post-translational modification landscape

Yannick Hartmaring

Data Analytics and Computational Statistics
Hasso Plattner Institute

Office: HPI Campus I, K-E.16
Tel.: +49 (0)331 5509 - 518
Email: yannick.hartmaring(at)hpi.de
Links:

Supervisor: Prof. Dr. Bernhard Renard and Dr. Christoph Schlaffner

Research

Background

A post-translational modification (PTM) is an alteration on a protein that changes the amino acid sequence into a functional proteoform. Such a modification can change the protein function and therefore highly increases the complexity of the overall proteome. PTMs regulate not only the physical or chemical properties of proteins but also their structure, stability, and cellular location. Overall, they affect almost all cellular processes and are also involved in many diseases.
Mass spectrometry has evolved into the method of choice to detect PTMs on a protein. In the bottom-up approach proteins are separated and digested into smaller pieces, called peptides, and afterwards their mass and intensity information are measured and represented in spectra. The resulting spectra are then matched against known spectra from highly conserved analyses or matched against theoretical spectra which are generated from a protein sequence database. These matches are called Peptide Sequence Matches (PSMs).
In addition to the most known and searched for PTMs (phosphorylation, glycosylation, ubiquitination, acetylation and methylation), more than 300 PTMs exist. Since the individual modifications also often interact and rely on each other, a much higher degree of combinatorial variation occurs. Discrepancies in these interactions is linked to the development of various diseases. For example, a disturbed function of O-GlcNAc has effects on chronic illnesses like Alzheimer's disease and diabetes.
The shotgun method to the analysis of protein sequences divided into subsequences drastically increases the accuracy of mass spectrometry. But on the same time, due to the fragmentation, it becomes impossible to make statements about the co-expression of PTMs across different peptides. Since it is not clear which PTMs are located simultaneously on the same protein molecule, and which do not occur together.

Approach

To address this problem I want to build a tool which is able to reconstruct the original proteoforms using statistical methods. Therefore, I analyse which PTMs generally occur in a sample. Then I utilize the PTM identifications and peptide quantitative information to initialise all possible proteoforms. With the help of an unsupervised machine learning approach, I then calculate a probability for each of the proteoforms and so reconstruct those proteoforms that are most likely expressed in the sample and also exclude those proteoforms that are not supported by the data.
This can then be used to synthetise the specific proteoforms in the lab for targeted and functional experiments and cut the costs and uncertainty. Also, it can be applied to already available search results of public datasets provided by e.g. the PRIDE Database to get a better overview of the human PTM-landscape.