Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI

Simulating a million-dollar instrument - with deep learning from molecules to data (Sommersemester 2021)

Lecturer: Prof. Dr. Bernhard Renard (Data Analytics and Computational Statistics) , Tom Altenburg (Data Analytics and Computational Statistics) , Dr. Sven Giese (Data Analytics and Computational Statistics)
Course Website: https://hpi.de/friedrich/moodle/course/view.php?id=152

General Information

  • Weekly Hours: 4
  • Credits: 6
  • Graded: yes
  • Enrolment Deadline: 18.03.2021 - 09.04.2021
  • Teaching Form: Seminar
  • Enrolment Type: Compulsory Elective Module
  • Course Language: English
  • Maximum number of participants: 8

Programs & Modules

IT-Systems Engineering MA
  • OSIS-Konzepte und Methoden
  • OSIS-Techniken und Werkzeuge
  • OSIS-Spezialisierung
  • SAMT-Konzepte und Methoden
  • SAMT-Techniken und Werkzeuge
  • SAMT-Spezialisierung
Data Engineering MA
Cybersecurity MA
Digital Health MA


Deep learning (DL) is a powerful type of machine learning algorithm. Typically DL methods are applied to make use of highly dimensional data with lots of training examples to perform single prediction tasks accurately. With sufficient training data, these models can also be used to simulate realistic looking real world data based on complex, interconnected events. An example of such an evolved multi-step process, is the analysis of proteins via mass spectrometry (MS). Modern healthcare applications have turned to analyze the proteins present in a cell in health and disease with MS. Unfortunately, MS instruments are expensive and tedious method development in the wet-lab is necessary to optimize and fine-tune existing algorithms that researchers have developed to turn raw data into biological information.

The goal in this praxis seminar is to leverage modern machine learning algorithms together with the ever increasing data availability (e.g. 310.77 TB in the public database massivekb) to simulate a million-dollar instrument. Instead of measuring real samples, we want to simulate realistic raw data for any given set of input (proteins). In particular, we will build and apply models that are able to deal with large amounts of sequence data (proteins) to simulate common experimentally performed steps in the MS analysis. The complex biochemical and physical properties that biomolecules have, lead to a different behavior in the MS. Uncovering these properties is necessary to deliver realistic looking raw data. Deep learning offers here the desirable properties, e.g. end-to-end learning, learning from millions of observations and defining multiple prediction tasks based on the same input. With the ability to simulate realistic looking raw data, we will be able to deliver a tool for software developers and MS practitioners. There are many benefits from such a tool. For example, faster turnaround times in algorithmic developments, creating ground truth datasets for complex benchmarks, optimization of instrument specific parameters, and many more. The challenges are manifold: handling large amounts of data, encoding sequence (protein) data, defining the prediction tasks, building and adapting a suitable network architecture / machine learning model, and finally bringing it all together.

The praxis seminar focuses on the development of a usable software product. To achieve this goal, collaborative code writing, code-reviews and discussions about features will be performed throughout the semester. While the overall goal is clear, we want to engage in scientific discussions and follow the ideas of participants for the realization of the project.


Learning Objectives:

  • Ability to organize large amounts of data
  • Ability to dissect complex tasks into manageable sub-tasks
  • Ability to critically plan and implement modern machine / deep learning models
  • Ability to statistically analyze results

Understand concepts about protein analysis           



  • Basic programming knowledge in Python or R or profound skills in another programming language
  • Knowledge in deep learning / machine learning
  • Knowledge about good practices in software design
  • Knowledge of English (The lecture will be given in English, but you can ask questions in German and submit German solutions etc.)
  • Fundamental knowledge of biology / chemistry is beneficial but NOT required



  1. Bouwmeester, R., Gabriels, R., Bossche, T. Van Den, Martens, L., & Degroeve, S. (2020). The Age of Data-Driven Proteomics: How Machine Learning Enables Novel Workflows. PROTEOMICS, 74(20), 1900351. https://doi.org/10.1002/pmic.201900351 
  2. Zhou, X. X., Zeng, W. F., Chi, H., Luo, C., Liu, C., Zhan, J., He, S. M., & Zhang, Z. (2017). PDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Analytical Chemistry, 89(23), 12690–12697. https://doi.org/10.1021/acs.analchem.7b02566
  3. Wen, B., Zeng, W. F., Liao, Y., Shi, Z., Savage, S. R., Jiang, W., & Zhang, B. (2020). Deep Learning in Proteomics. In Proteomics. https://doi.org/10.1002/pmic.201900335 
  4. MassIVE repository: https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp


Lectures and meetings will be held in zoom with visual aids when appropriate or in person. 

Call-In details will be provided in time. Regular, fixed meetings will be organized to discuss project progress. Additional ad-hoc meetings are available upon request.

­­­­­Depending on the Corona situation and preferences of the students the course will be offered online or onsite.

Please register via Moodle (SimDeep) by April, 12th.


The students will implement a functionally, well-documented software package and participate in code-reviews and progress report meetings. The final product will be described and evaluated in a publication-like paper. 

Final grade will be derived by weighting the following deliverables:

  1. Introduction and progress report (30%)
  2. Code Quality (20%)
  3. Final ‘publication-ready’ report (50%)


Thursday 11:00 - 12:30 . Other times can be discussed with the participants as well during the kick-off meeting.