Expanding the Simulation of a million-dollar instrument - with deep learning from molecules to data (Wintersemester 2021/2022)

Lecturer: Dr. Katharina Baum (Data Analytics and Computational Statistics) , Dr. Sven Giese (Data Analytics and Computational Statistics)
Course Website: https://moodle.hpi.de/course/view.php?id=223

General Information

Weekly Hours: 4
Credits: 6
Graded: yes
Enrolment Deadline: 01.10.2021 -22.10.2021
Teaching Form: Seminar
Enrolment Type: Compulsory Elective Module
Course Language: English
Maximum number of participants: 8

Programs, Module Groups & Modules

IT-Systems Engineering MA

OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-K Konzepte und Methoden
SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-S Spezialisierung
SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-T Techniken und Werkzeuge

Data Engineering MA

DATA: Data Analytics
- HPI-DATA-K Konzepte und Methoden
DATA: Data Analytics
- HPI-DATA-T Techniken und Werkzeuge
DATA: Data Analytics
- HPI-DATA-S Spezialisierung
PREP: Data Preparation
- HPI-PREP-T Techniken und Werkzeuge
PREP: Data Preparation
- HPI-PREP-S Spezialisierung
PREP: Data Preparation
- HPI-PREP-K Konzepte und Methoden

Digital Health MA

APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-C Concepts and Methods
APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-T Technologies and Tools
APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-S Specialization
SCAD: Scalable Computing and Algorithms for Digital Health
- HPI-SCAD-C Concepts and Methods
SCAD: Scalable Computing and Algorithms for Digital Health
- HPI-SCAD-T Technologies and Tools
SCAD: Scalable Computing and Algorithms for Digital Health
- HPI-SCAD-S Specialization

Cybersecurity MA

Description

Deep learning (DL) is a powerful type of machine learning algorithm. Typically DL methods are applied to make use of highly dimensional data with lots of training examples to perform single prediction tasks accurately. With sufficient training data, these models can also be used to simulate realistic looking real world data based on complex, interconnected events. An example of such an evolved multi-step process is the analysis of proteins via mass spectrometry (MS). Modern healthcare applications have turned to analyze the proteins present in a cell in health and disease with MS. Unfortunately, MS instruments are expensive, and tedious method development in the wet-lab is necessary to optimize and fine-tune existing algorithms that researchers have developed to turn raw data into biological information.

Overall, the goal in this praxis seminar is to leverage modern machine learning algorithms together with the ever increasing data availability (e.g. 310.77 TB in the public database massivekb) to simulate a million-dollar instrument. Instead of measuring real samples, we want to simulate realistic raw data for any given set of input (proteins). In particular, we will build and apply models that are able to deal with large amounts of sequence data (proteins) to simulate common experimentally performed steps in the MS analysis. The complex biochemical and physical properties that biomolecules have lead to a different behavior in the MS. Uncovering these properties is necessary to deliver realistic looking raw data. Deep learning offers here the desirable properties, e.g. end-to-end learning, learning from millions of observations and defining multiple prediction tasks based on the same input. With the ability to simulate realistic looking raw data, we will be able to deliver a tool for software developers and MS practitioners. There are many benefits from such a tool. For example, faster turnaround times in algorithmic developments, creating ground truth datasets for complex benchmarks, optimization of instrument specific parameters, and many more. The challenges are manifold: handling large amounts of data, encoding sequence (protein) data, defining the prediction tasks, building and adapting a suitable network architecture / machine learning model, and finally bringing it all together in a well-documented software package.

During the summer term 2020 (the first run of this praxis seminar) a python package named “millipede” was implemented by the participants. Millipede is implemented in python making use of the popular tensorflow package for deep learning applications. This semester, we want to focus on expanding the supported prediction tasks and also increase the complexity of the implemented simulation. In addition, we want to establish common design and software architecture patterns for the prediction tasks. The course will bring together software engineering and practical deep learning in a life science context.

The praxis seminar focuses on the development of a usable software product. To achieve this goal, collaborative code writing, code-reviews and discussions about features will be performed throughout the semester. While the overall scientific goal is to produce a publishable software package, we want to engage in scientific discussions and follow the ideas of participants for the realization of the project. We will follow an agile development cycle and assign / split tasks to small groups of students.

Learning Objectives:

Ability to organize large amounts of data
Ability to dissect complex tasks into manageable sub-tasks
Ability to critically plan and implement modern machine / deep learning models
Ability to statistically analyze results
Understand concepts about protein analysis

Requirements

Basic programming knowledge in Python or R or profound skills in another programming language
Knowledge in deep learning / machine learning
Knowledge about good practices in software design
Knowledge of English
Fundamental knowledge of biology / chemistry is beneficial but NOT required

Literature

Bouwmeester, R., Gabriels, R., Bossche, T. Van Den, Martens, L., & Degroeve, S. (2020). The Age of Data-Driven Proteomics: How Machine Learning Enables Novel Workflows. PROTEOMICS, 74(20), 1900351. https://doi.org/10.1002/pmic.201900351
Zhou, X. X., Zeng, W. F., Chi, H., Luo, C., Liu, C., Zhan, J., He, S. M., & Zhang, Z. (2017). PDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Analytical Chemistry, 89(23), 12690–12697. https://doi.org/10.1021/acs.analchem.7b02566
Wen, B., Zeng, W. F., Liao, Y., Shi, Z., Savage, S. R., Jiang, W., & Zhang, B. (2020). Deep Learning in Proteomics. In Proteomics. https://doi.org/10.1002/pmic.201900335
MassIVE repository: https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp

Learning

Lectures and meetings will be held in zoom with visual aids or when appropriate in person. Hybrid-forms may also be viable for selected lectures / meetings.

Call-in details will be provided in time. Regular, fixed meetings will be organized to discuss project progress. Additional ad-hoc meetings are available upon request.

Depending on the Corona situation and preferences of the students the course will be offered online or onsite.

Please register by enrolling in the corresponding moodle course: moodle.hpi.de/course/view.php

Examination

The students will implement a functional, well-documented software package and participate in code-reviews and progress report meetings. The final product will be described and evaluated in a publication-like paper.

Final grade will be derived by weighting:

Introduction (20%) and progress presentations (30%)
Final ‘publication-ready’ report (50%)

The first assessment (presentation) will take place on the 02/03/2022.

Dates

Kick-off Meeting: Monday, 25.10.2021 (13:30 - 15:00). Either in A1.1 (currently preferred) or online.

The format is adapted upon personal communication.

The course takesplace in room F.E.-06.

The first assessment (presentation) will take place on the 02/03/2022. (opt-out by 24.02.2022)

Zurück