Simulating a million-dollar instrument - with deep learning from molecules to data (Sommersemester 2021)

Dozent: Prof. Dr. Bernhard Renard (Data Analytics and Computational Statistics) , Tom Altenburg (Data Analytics and Computational Statistics) , Dr. Sven Giese (Data Analytics and Computational Statistics)
Website zum Kurs: https://hpi.de/friedrich/moodle/course/view.php?id=152

Allgemeine Information

Semesterwochenstunden: 4
ECTS: 6
Benotet: Ja
Einschreibefrist: 18.03.2021 - 09.04.2021
Lehrform: Seminar
Belegungsart: Wahlpflichtmodul
Lehrsprache: Englisch
Maximale Teilnehmerzahl: 8

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA

OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-K Konzepte und Methoden
SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-T Techniken und Werkzeuge
SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-S Spezialisierung

Data Engineering MA

DATA: Data Analytics
- HPI-DATA-K Konzepte und Methoden
DATA: Data Analytics
- HPI-DATA-T Techniken und Werkzeuge
DATA: Data Analytics
- HPI-DATA-S Spezialisierung
CODS: Complex Data Systems
- HPI-CODS-K Konzepte und Methoden
CODS: Complex Data Systems
- HPI-CODS-T Techniken und Werkzeuge
CODS: Complex Data Systems
- HPI-CODS-S Spezialisierung

Cybersecurity MA

CYAD: Cyber Attack and Defense
- HPI-CYAD-K Konzepte und Methoden
CYAD: Cyber Attack and Defense
- HPI-CYAD-T Techniken und Werkzeuge
CYAD: Cyber Attack and Defense
- HPI-CYAD-S Spezialisierung
SECA: Security Analytics
- HPI-SECA-K Konzepte und Methoden
SECA: Security Analytics
- HPI-SECA-T Techniken und Werkzeuge
SECA: Security Analytics
- HPI-SECA-S Spezialisierung

Digital Health MA

Beschreibung

Deep learning (DL) is a powerful type of machine learning algorithm. Typically DL methods are applied to make use of highly dimensional data with lots of training examples to perform single prediction tasks accurately. With sufficient training data, these models can also be used to simulate realistic looking real world data based on complex, interconnected events. An example of such an evolved multi-step process, is the analysis of proteins via mass spectrometry (MS). Modern healthcare applications have turned to analyze the proteins present in a cell in health and disease with MS. Unfortunately, MS instruments are expensive and tedious method development in the wet-lab is necessary to optimize and fine-tune existing algorithms that researchers have developed to turn raw data into biological information.

The goal in this praxis seminar is to leverage modern machine learning algorithms together with the ever increasing data availability (e.g. 310.77 TB in the public database massivekb) to simulate a million-dollar instrument. Instead of measuring real samples, we want to simulate realistic raw data for any given set of input (proteins). In particular, we will build and apply models that are able to deal with large amounts of sequence data (proteins) to simulate common experimentally performed steps in the MS analysis. The complex biochemical and physical properties that biomolecules have, lead to a different behavior in the MS. Uncovering these properties is necessary to deliver realistic looking raw data. Deep learning offers here the desirable properties, e.g. end-to-end learning, learning from millions of observations and defining multiple prediction tasks based on the same input. With the ability to simulate realistic looking raw data, we will be able to deliver a tool for software developers and MS practitioners. There are many benefits from such a tool. For example, faster turnaround times in algorithmic developments, creating ground truth datasets for complex benchmarks, optimization of instrument specific parameters, and many more. The challenges are manifold: handling large amounts of data, encoding sequence (protein) data, defining the prediction tasks, building and adapting a suitable network architecture / machine learning model, and finally bringing it all together.

The praxis seminar focuses on the development of a usable software product. To achieve this goal, collaborative code writing, code-reviews and discussions about features will be performed throughout the semester. While the overall goal is clear, we want to engage in scientific discussions and follow the ideas of participants for the realization of the project.

Learning Objectives:

Ability to organize large amounts of data
Ability to dissect complex tasks into manageable sub-tasks
Ability to critically plan and implement modern machine / deep learning models
Ability to statistically analyze results

Understand concepts about protein analysis

Voraussetzungen

Basic programming knowledge in Python or R or profound skills in another programming language
Knowledge in deep learning / machine learning
Knowledge about good practices in software design
Knowledge of English (The lecture will be given in English, but you can ask questions in German and submit German solutions etc.)
Fundamental knowledge of biology / chemistry is beneficial but NOT required

Literatur

Bouwmeester, R., Gabriels, R., Bossche, T. Van Den, Martens, L., & Degroeve, S. (2020). The Age of Data-Driven Proteomics: How Machine Learning Enables Novel Workflows. PROTEOMICS, 74(20), 1900351. https://doi.org/10.1002/pmic.201900351
Zhou, X. X., Zeng, W. F., Chi, H., Luo, C., Liu, C., Zhan, J., He, S. M., & Zhang, Z. (2017). PDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Analytical Chemistry, 89(23), 12690–12697. https://doi.org/10.1021/acs.analchem.7b02566
Wen, B., Zeng, W. F., Liao, Y., Shi, Z., Savage, S. R., Jiang, W., & Zhang, B. (2020). Deep Learning in Proteomics. In Proteomics. https://doi.org/10.1002/pmic.201900335
MassIVE repository: https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp

Lern- und Lehrformen

Lectures and meetings will be held in zoom with visual aids when appropriate or in person.

Call-In details will be provided in time. Regular, fixed meetings will be organized to discuss project progress. Additional ad-hoc meetings are available upon request.

Depending on the Corona situation and preferences of the students the course will be offered online or onsite.

Please register via Moodle (SimDeep) by April, 12th.

Leistungserfassung

The students will implement a functionally, well-documented software package and participate in code-reviews and progress report meetings. The final product will be described and evaluated in a publication-like paper.

Final grade will be derived by weighting the following deliverables:

Introduction and progress report (30%)
Code Quality (20%)
Final ‘publication-ready’ report (50%)

Termine

Thursday 11:00 - 12:30 . Other times can be discussed with the participants as well during the kick-off meeting.

Zurück