Simulating a million-dollar instrument - with deep learning from molecules to data (Sommersemester 2021)
Dozent:
Prof. Dr. Bernhard Renard
(Data Analytics and Computational Statistics)
,
Tom Altenburg
(Data Analytics and Computational Statistics)
,
Dr. Sven Giese
(Data Analytics and Computational Statistics)
Website zum Kurs:
https://hpi.de/friedrich/moodle/course/view.php?id=152
Allgemeine Information
- Semesterwochenstunden: 4
- ECTS: 6
- Benotet:
Ja
- Einschreibefrist: 18.03.2021 - 09.04.2021
- Lehrform: Seminar
- Belegungsart: Wahlpflichtmodul
- Lehrsprache: Englisch
- Maximale Teilnehmerzahl: 8
Studiengänge, Modulgruppen & Module
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-K Konzepte und Methoden
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-T Techniken und Werkzeuge
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-S Spezialisierung
- DATA: Data Analytics
- HPI-DATA-K Konzepte und Methoden
- DATA: Data Analytics
- HPI-DATA-T Techniken und Werkzeuge
- DATA: Data Analytics
- HPI-DATA-S Spezialisierung
- CODS: Complex Data Systems
- HPI-CODS-K Konzepte und Methoden
- CODS: Complex Data Systems
- HPI-CODS-T Techniken und Werkzeuge
- CODS: Complex Data Systems
- HPI-CODS-S Spezialisierung
- CYAD: Cyber Attack and Defense
- HPI-CYAD-K Konzepte und Methoden
- CYAD: Cyber Attack and Defense
- HPI-CYAD-T Techniken und Werkzeuge
- CYAD: Cyber Attack and Defense
- HPI-CYAD-S Spezialisierung
- SECA: Security Analytics
- HPI-SECA-K Konzepte und Methoden
- SECA: Security Analytics
- HPI-SECA-T Techniken und Werkzeuge
- SECA: Security Analytics
- HPI-SECA-S Spezialisierung
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-T Technologies and Tools
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-C Concepts and Methods
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-S Specialization
Beschreibung
Deep learning (DL) is a powerful type of machine learning algorithm. Typically DL methods are applied to make use of highly dimensional data with lots of training examples to perform single prediction tasks accurately. With sufficient training data, these models can also be used to simulate realistic looking real world data based on complex, interconnected events. An example of such an evolved multi-step process, is the analysis of proteins via mass spectrometry (MS). Modern healthcare applications have turned to analyze the proteins present in a cell in health and disease with MS. Unfortunately, MS instruments are expensive and tedious method development in the wet-lab is necessary to optimize and fine-tune existing algorithms that researchers have developed to turn raw data into biological information.
The goal in this praxis seminar is to leverage modern machine learning algorithms together with the ever increasing data availability (e.g. 310.77 TB in the public database massivekb) to simulate a million-dollar instrument. Instead of measuring real samples, we want to simulate realistic raw data for any given set of input (proteins). In particular, we will build and apply models that are able to deal with large amounts of sequence data (proteins) to simulate common experimentally performed steps in the MS analysis. The complex biochemical and physical properties that biomolecules have, lead to a different behavior in the MS. Uncovering these properties is necessary to deliver realistic looking raw data. Deep learning offers here the desirable properties, e.g. end-to-end learning, learning from millions of observations and defining multiple prediction tasks based on the same input. With the ability to simulate realistic looking raw data, we will be able to deliver a tool for software developers and MS practitioners. There are many benefits from such a tool. For example, faster turnaround times in algorithmic developments, creating ground truth datasets for complex benchmarks, optimization of instrument specific parameters, and many more. The challenges are manifold: handling large amounts of data, encoding sequence (protein) data, defining the prediction tasks, building and adapting a suitable network architecture / machine learning model, and finally bringing it all together.
The praxis seminar focuses on the development of a usable software product. To achieve this goal, collaborative code writing, code-reviews and discussions about features will be performed throughout the semester. While the overall goal is clear, we want to engage in scientific discussions and follow the ideas of participants for the realization of the project.
Learning Objectives:
- Ability to organize large amounts of data
- Ability to dissect complex tasks into manageable sub-tasks
- Ability to critically plan and implement modern machine / deep learning models
- Ability to statistically analyze results
Understand concepts about protein analysis
Voraussetzungen
- Basic programming knowledge in Python or R or profound skills in another programming language
- Knowledge in deep learning / machine learning
- Knowledge about good practices in software design
- Knowledge of English (The lecture will be given in English, but you can ask questions in German and submit German solutions etc.)
- Fundamental knowledge of biology / chemistry is beneficial but NOT required
Literatur
- Bouwmeester, R., Gabriels, R., Bossche, T. Van Den, Martens, L., & Degroeve, S. (2020). The Age of Data-Driven Proteomics: How Machine Learning Enables Novel Workflows. PROTEOMICS, 74(20), 1900351. https://doi.org/10.1002/pmic.201900351
- Zhou, X. X., Zeng, W. F., Chi, H., Luo, C., Liu, C., Zhan, J., He, S. M., & Zhang, Z. (2017). PDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Analytical Chemistry, 89(23), 12690–12697. https://doi.org/10.1021/acs.analchem.7b02566
- Wen, B., Zeng, W. F., Liao, Y., Shi, Z., Savage, S. R., Jiang, W., & Zhang, B. (2020). Deep Learning in Proteomics. In Proteomics. https://doi.org/10.1002/pmic.201900335
- MassIVE repository: https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp
Lern- und Lehrformen
Lectures and meetings will be held in zoom with visual aids when appropriate or in person.
Call-In details will be provided in time. Regular, fixed meetings will be organized to discuss project progress. Additional ad-hoc meetings are available upon request.
Depending on the Corona situation and preferences of the students the course will be offered online or onsite.
Please register via Moodle (SimDeep) by April, 12th.
Leistungserfassung
The students will implement a functionally, well-documented software package and participate in code-reviews and progress report meetings. The final product will be described and evaluated in a publication-like paper.
Final grade will be derived by weighting the following deliverables:
- Introduction and progress report (30%)
- Code Quality (20%)
- Final ‘publication-ready’ report (50%)
Termine
Thursday 11:00 - 12:30 . Other times can be discussed with the participants as well during the kick-off meeting.
Zurück