Expanding the Simulation of a million-dollar instrument - with deep learning from molecules to data (Wintersemester 2021/2022)
Lecturer:
Dr. Katharina Baum
(Data Analytics and Computational Statistics)
,
Dr. Sven Giese
(Data Analytics and Computational Statistics)
Course Website:
https://moodle.hpi.de/course/view.php?id=223
General Information
- Weekly Hours: 4
- Credits: 6
- Graded:
yes
- Enrolment Deadline: 01.10.2021 -22.10.2021
- Teaching Form: Seminar
- Enrolment Type: Compulsory Elective Module
- Course Language: English
- Maximum number of participants: 8
Programs, Module Groups & Modules
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-K Konzepte und Methoden
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-S Spezialisierung
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-T Techniken und Werkzeuge
- DATA: Data Analytics
- HPI-DATA-K Konzepte und Methoden
- DATA: Data Analytics
- HPI-DATA-T Techniken und Werkzeuge
- DATA: Data Analytics
- HPI-DATA-S Spezialisierung
- PREP: Data Preparation
- HPI-PREP-T Techniken und Werkzeuge
- PREP: Data Preparation
- HPI-PREP-S Spezialisierung
- PREP: Data Preparation
- HPI-PREP-K Konzepte und Methoden
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-C Concepts and Methods
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-T Technologies and Tools
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-S Specialization
- SCAD: Scalable Computing and Algorithms for Digital Health
- HPI-SCAD-C Concepts and Methods
- SCAD: Scalable Computing and Algorithms for Digital Health
- HPI-SCAD-T Technologies and Tools
- SCAD: Scalable Computing and Algorithms for Digital Health
- HPI-SCAD-S Specialization
- SECA: Security Analytics
- HPI-SECA-K Konzepte und Methoden
- SECA: Security Analytics
- HPI-SECA-T Techniken und Werkzeuge
- SECA: Security Analytics
- HPI-SECA-S Spezialisierung
Description
Deep learning (DL) is a powerful type of machine learning algorithm. Typically DL methods are applied to make use of highly dimensional data with lots of training examples to perform single prediction tasks accurately. With sufficient training data, these models can also be used to simulate realistic looking real world data based on complex, interconnected events. An example of such an evolved multi-step process is the analysis of proteins via mass spectrometry (MS). Modern healthcare applications have turned to analyze the proteins present in a cell in health and disease with MS. Unfortunately, MS instruments are expensive, and tedious method development in the wet-lab is necessary to optimize and fine-tune existing algorithms that researchers have developed to turn raw data into biological information.
Overall, the goal in this praxis seminar is to leverage modern machine learning algorithms together with the ever increasing data availability (e.g. 310.77 TB in the public database massivekb) to simulate a million-dollar instrument. Instead of measuring real samples, we want to simulate realistic raw data for any given set of input (proteins). In particular, we will build and apply models that are able to deal with large amounts of sequence data (proteins) to simulate common experimentally performed steps in the MS analysis. The complex biochemical and physical properties that biomolecules have lead to a different behavior in the MS. Uncovering these properties is necessary to deliver realistic looking raw data. Deep learning offers here the desirable properties, e.g. end-to-end learning, learning from millions of observations and defining multiple prediction tasks based on the same input. With the ability to simulate realistic looking raw data, we will be able to deliver a tool for software developers and MS practitioners. There are many benefits from such a tool. For example, faster turnaround times in algorithmic developments, creating ground truth datasets for complex benchmarks, optimization of instrument specific parameters, and many more. The challenges are manifold: handling large amounts of data, encoding sequence (protein) data, defining the prediction tasks, building and adapting a suitable network architecture / machine learning model, and finally bringing it all together in a well-documented software package.
During the summer term 2020 (the first run of this praxis seminar) a python package named “millipede” was implemented by the participants. Millipede is implemented in python making use of the popular tensorflow package for deep learning applications. This semester, we want to focus on expanding the supported prediction tasks and also increase the complexity of the implemented simulation. In addition, we want to establish common design and software architecture patterns for the prediction tasks. The course will bring together software engineering and practical deep learning in a life science context.
The praxis seminar focuses on the development of a usable software product. To achieve this goal, collaborative code writing, code-reviews and discussions about features will be performed throughout the semester. While the overall scientific goal is to produce a publishable software package, we want to engage in scientific discussions and follow the ideas of participants for the realization of the project. We will follow an agile development cycle and assign / split tasks to small groups of students.
Learning Objectives:
- Ability to organize large amounts of data
- Ability to dissect complex tasks into manageable sub-tasks
- Ability to critically plan and implement modern machine / deep learning models
- Ability to statistically analyze results
- Understand concepts about protein analysis
Requirements
- Basic programming knowledge in Python or R or profound skills in another programming language
- Knowledge in deep learning / machine learning
- Knowledge about good practices in software design
- Knowledge of English
- Fundamental knowledge of biology / chemistry is beneficial but NOT required
Literature
- Bouwmeester, R., Gabriels, R., Bossche, T. Van Den, Martens, L., & Degroeve, S. (2020). The Age of Data-Driven Proteomics: How Machine Learning Enables Novel Workflows. PROTEOMICS, 74(20), 1900351. https://doi.org/10.1002/pmic.201900351
- Zhou, X. X., Zeng, W. F., Chi, H., Luo, C., Liu, C., Zhan, J., He, S. M., & Zhang, Z. (2017). PDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Analytical Chemistry, 89(23), 12690–12697. https://doi.org/10.1021/acs.analchem.7b02566
- Wen, B., Zeng, W. F., Liao, Y., Shi, Z., Savage, S. R., Jiang, W., & Zhang, B. (2020). Deep Learning in Proteomics. In Proteomics. https://doi.org/10.1002/pmic.201900335
- MassIVE repository: https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp
Learning
Lectures and meetings will be held in zoom with visual aids or when appropriate in person. Hybrid-forms may also be viable for selected lectures / meetings.
Call-in details will be provided in time. Regular, fixed meetings will be organized to discuss project progress. Additional ad-hoc meetings are available upon request.
Depending on the Corona situation and preferences of the students the course will be offered online or onsite.
Please register by enrolling in the corresponding moodle course: moodle.hpi.de/course/view.php
Examination
The students will implement a functional, well-documented software package and participate in code-reviews and progress report meetings. The final product will be described and evaluated in a publication-like paper.
Final grade will be derived by weighting:
- Introduction (20%) and progress presentations (30%)
- Final ‘publication-ready’ report (50%)
The first assessment (presentation) will take place on the 02/03/2022.
Dates
Kick-off Meeting: Monday, 25.10.2021 (13:30 - 15:00). Either in A1.1 (currently preferred) or online.
The format is adapted upon personal communication.
The course takesplace in room F.E.-06.
The first assessment (presentation) will take place on the 02/03/2022. (opt-out by 24.02.2022)
Zurück