Online refinement learning of peptide property prediction via DLOmix in Oktoberfest
Topic
We are building a variety of domain-specific framework for proteomics (Figure 1-3). Among which are
- DLOmix, a deep learning to facilitate building, training, and using deep learning models,
- Koina, a de-centralized framework for hosting and serving (DLOmix) models, and
- Oktoberfest, a framework that utilizes models (via Koina) to re-assess proteomics data.
Students participating in the practical course will have the chance to contribute to this environment of tools that matches their background, skills, and interests. The project work will span multiple areas with varying focus, including but not limited to training, serving and utilizing models, exploring state-of-the-art model implementations, experimenting with machine learning tasks in proteomics, implementing/extending features of our frameworks, and testing/benchmarking.
Aim
Students will be offered a larger topic that can be split into sub-projects matching their preferences and the skills they bring to the practical course. Overall, the project is aimed to extend the functionality of Oktoberfest to allow the online refinement (or transfer) learning of pre-trained models, subsequent local hosting, and the utilization of such models in Oktoberfest.
Online training dataset generation in Oktoberfest: Paramount to machine learning is the existence of (high quality) training data. Oktoberfest’s processing pipeline should be extended to allow the generation of relevant training data in DLOmix format for subsequent model training. This should be done in alignment with our previous efforts to generate re-usable training data (see PROSPECT).
Efficient refinement (or transfer) learning in DLOmix: DLOmix already allows the training of models from scratch using existing annotated data. For refinement (or transfer learning) DLOmix needs to load a pre-trained models weight files and re-initiate training using the prior annotated data from Oktoberfest. This may include the implementation of cross validation, or other alternative approaches like test-time-training, to avoid substantial overfitting.
Local DLOmix model hosting: Oktoberfest currently relies on Koina serving predictions from models. With the real-time refinement training, serving the newly developed models is likely most effective when done locally. The aim here is to implement a method that allows Oktoberfest to retrieve predictions from the locally generated model, rather than relying on the external Koina service.
Performance assessment: After implementing the full workflow, the process needs to be tested and evaluated. The aim is to locate and process relevant data with the newly developed pipeline to show (in the best case) an improved performance using the adjusted workflow.
Optional tasks: If interested and time allows, there are a number of further extension which would fit naturally to the proposed functionality. First, in order to decide whether and or on which subset refinement training should be applied, out-of-distribution detection may be necessary and can be implemented in DLOmix. Second, since model may not perform equally well on all subsets of the data provided, the integration of confidence prediction might be relevant to further boost the performance. In the prior advanced practical course substantial progress was made to build models that allow the estimation of the prediction confidence. This could be integrated into the scoring process. Further, currently ongoing work at our group may provide additional methods for efficient transfer learning models to new peptide classes. These could be integrate into the process as well.
General Schedule
Phase 1: Methods, Tools, Techniques
The first phase consists of a series of seminars in which you will learn the basics of various topics necessary for the project. The seminars will be a mix of presentations by team members of our research group, practical sessions where applicable, and short presentations prepared by the participants:
Kickoff Seminar - In the first seminar, we will discuss the organizational part and provide you with insights into the project. We will demonstrate the high-level overview of our workflows to enable you to select your focus area, scope the project work, and later on deliver a successful project.
Machine Learning - This lecture will cover machine learning techniques we apply. Mainly deep learning, refinement (and transfer) learning, and interpretation. This will include an introduction into DLOmix. You will get insights into our current architecture, the different possibilities for extensions, access to our infrastructure, and how we use it.
Utilizing model - This session will discuss in more details how machine learning models are used in proteomics research. Specifically, this lecture covers Oktoberfest, its processing steps, structure and purpose. After this, you will understand where and how to integrate the extensions from DLOmix into Oktoberfest.
Experimentation best practices - This will be organized in a block session covering experimentation best practices from our experience with deep learning, software development, and will touch on topics such as implementing modules, testing several components, iterating fast, etc.
Additional Topics - In case you want to get deeper knowledge we are open to hold an additional seminar with a topic of your choice. You decide.
Phase 2: Research project planning
In the second phase, we want to you to prepare a detailed project plan. At the end of this phase, you will present your plan and discuss it with us. We will assist you during planning of your project and provide you with feedback to ensure that you are able to bring your project to a success. Most importantly, you should discuss the following points:
Focus Areas Selection You, as a group, will select and agree on a strategy to integrate the new functionality. Based on that, we will provide you with resources and alternatives to consider for scoping and planning your project.
Requirement analysis You, as a group, will have to discuss and plan what exact research questions proposed by us (or by you) will be covered and by whom. This will include requirement analysis of the extensions to the framework and the features to implement.
Organization Projects work best if they are well organized. In your plan, we expect to see clear milestones, a time plan, and a reasonable task distribution among individual team members. If you like, you could use tools like Monday.com to organize your work. Communication in our lab is conducted via Slack.
Frameworks and Languages We use the common python stack for numerical computing (numpy, pandas) along with TensorFlow/Keras for deep learning. In selected cases, we also use R for plotting and/or analyses.
Phase 3: Implementation and Research
This is the main phase of your project. According to your plan, you will implement, integrate, and test your work with our existing architecture. We will hold weekly to discuss your progress.
Semester Work - We would like you to work during your semester. You do not need to come in person all the time although we provide a large room where you can work. However, we encourage you to come to our research group. If you are on-sight, we will be able to assist and guide you more directly. Virtually, we can also offer office-hours for addressing questions.
Full Time Block - Depending on how you plan your project and your progress we can have a two to three week long intensive block during or at the end of your semester. The specific time and requirement will be discussed with you.
Submission - At the end of this project, you will need to provide some deliverables that include research questions answered, features implemented, code documentation, and testing. Ideally, this would evolve as your implementations progresses during the semester. We expect you to write a report and give a presentation of your work in our research group.
Skills Gained
By participating in this project, you will improve your programming skills, gain knowledge in data analysis for proteomics, and gain experience in deep learning. You will also gain a solid basis in software engineering and development that will be very helpful for your career, both in academia and in the industry.
Organisation
Programming language Python
Must-have skills
- Intermediate programming skills in Python
- Basic understanding of machine learning concepts
- Interest in applied research and writing good quality code
- Interest in life sciences (or specifically proteomics)
Good-to-have skills
- Knowledge of good software engineering practices
- Knowledge of common and state-of-the-art deep learning architectures (Transformers, RNNs, CNNs)
- Knowledge of TensorFlow/Keras and PyTorch
- Basic skills in Git and GitHub
Supervisors
- Mario Picciani - primary - Oktoberfest specialist
- Joel Lapin - primary - Deep learning specialist
- Wassim Gabriel - secondary - Prosit and DLOmix specialist
- Omar Shouman - secondary - DLOmix specialist
- Mathias Wilhelm - secondary - Overall guidance
Grading Presentation [max. 30 minutes whole team] and report [~10 pages including everything] – everyone with some contribution.
Team size This project is designed for a team of 4-6 people.
Submission Git branch/commit/pull request and documentation; report including a detailed use case.
Location/Rooms We are very flexible with regards to the time slots and location. As of right now, we expect phase 1-3 will take place either in Freising/Weihenstephan (or via Zoom – but this is not preferred). Lecture preferably in the afternoon (tentative Monday 16-18), project meetings preferably at the same time. However, we are flexible with respect to the day and will decide with your input. We also have a student room that you can use but you may also work from home if you so desire. We would like to welcome you in our institute. The full time block will take place - if required - in Freising/Weihenstephan. The exact weeks will be decided upon once we know if it is required. Online participation is possible for all part except the full time block.
Material
All materials are made available in TUM Moodle.
Literature
GitHub repos of relevance:
- DLOmix framework *
- DLOmix resource with Colab examples *
- PROSPECT training data
- Koina serving predictions
- Oktoberfest rescoring
Relevant supported models and training data:
- Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning
- PROSPECT: Labeled Tandem Mass Spectrometry Dataset for Machine Learning in Proteomics *
Serving and utilizing models:
- Koina: a community driven endeavour to make AI models for life science research accessible
- Oktoberfest: Open-source spectral library generation and rescoring pipeline based on Prosit *
Marked with * are the key references