Advanced Practical Course Bioinformatics Bioinformatics - 2023

Extending the Proteomics-Specific Deep Learning Framework DLOmix

Topic

We are building a domain-specific deep learning framework (Figure 1) for proteomics, DLOmix. The main idea is to facilitate building, training, and using deep learning models and workflows for various proteomics tasks; specifically, tasks related to predicting peptide properties (Figure 2).

Students participating in the practical course will have the chance to select an area of depth that matches their background, skills, and interests. The project work will span multiple areas with varying focus, examples include exploring state-of-the-art model implementations, experimenting with new machine learning tasks in proteomics, implementing/extending features of our framework, and testing/benchmarking.

Aim

Students will be offered a range of topics to choose from based on their preferences and the skills they bring to the practical course. Most of the topics will span several areas of depth, but will have the main work-package in one or two areas at a maximum. Example topics are listed below:

A new machine learning task in proteomics: the steps here would involve reading literature around the provided task, exploring model architectures, implementing a selected model, training/validating with our internal datasets, and integrating it into the current codebase.

A new evaluation and reporting for existing tasks: the steps here would include exploring literature to understand loss functions and evaluation metrics for already implemented tasks in our framework, re-design/extend reports for the chosen task(s), and implement report generation accordingly.

A new model architecture for existing tasks: the workflow would be to identify from literature a state-of-the-art model that was used or can be used on one of the existing tasks, implement the new architecture, train/validate with our internal datasets, and integrate the implementation into the current codebase.

Other ideas are also welcome depending on the participant's interests, including but not limited to deeper error-analysis of performance of existing models, extending current models to quantify confidence in the predictions, ideas related to multi-task learning or few-shot learning, or ideas to reduce model size or training time through distillation, quantization, etc....

General Schedule

Phase 1: Methods, Tools, Techniques

The first phase consists of a series of seminars in which you will learn the basics of various topics necessary for the project. The seminars will be a mix of presentations by team members of our research group, practical sessions where applicable, and short presentations prepared by the participants:

Kickoff Seminar - In the first seminar, we will discuss the organizational part and provide you with insights into the project. We will demonstate the high-level overview of our workflows to enable you to select your focus area, scope the project work, and later on deliver a successful project.

Machine Learning - The third lecture will cover machine learning techniques we apply. Mainly deep learning, hyperparameter optimization, and interpretation.

DLOmix - The second lecture will cover DLOmix. You will get insights into our current architecture, the different possibilities for extensions, access to our infrastructure, and how we use it.

Experimentation best practices - This will be organized in a block session covering experimentation best practices from our experience with deep learning and will touch on topics such as implementing modules, testing several components, iterating fast, etc.

Additional Topics - In case you want to get deeper knowledge we are open to hold an additional seminar with a topic of your choice. You decide.

Phase 2: Research project planning

In the second phase, we want to you to prepare a detailed project plan. At the end of this phase, you will present your plan and discuss it with us. We will assist you during planning of your project and provide you with feedback to ensure that you are able to bring your project to a success. Most importantly, you should discuss the following points:

Focus Areas Selection You, as a group, will select and agree on a focus area as explained above. Based on that, we will provide you with resources and alternatives to consider for scoping and planning your project.

Requirement analysis You, as a group, will have to discuss and plan what exact research questions proposed by us (or by you) will be covered and by whom. This will include requirement analysis of the extensions to the framework and the features to implement.

Organization Projects work best if they are well organized. In your plan, we expect to see clear milestones, a time plan, and a reasonable task distribution among individual team members. If you like, you could use tools like Monday.com to organize your work. Communication in our lab is conducted via Slack.

Frameworks and Languages We use the common python stack for numerical computing (numpy, pandas) along with TensorFlow/Keras for deep learning. In selected cases, we also use R for plotting and/or analyses.

Phase 3: Implementation and Research

This is the main phase of your project. According to your plan, you will implement, integrate, and test your work with our existing architecture. We will hold weekly to discuss your progress.

Semester Work - We would like you to work during your semester. You do not need to come in person all the time although we provide a large room where you can work. However, we encourage you to come to our research group. If you are on-sight, we will be able to assist and guide you more directly. Virtually, we can also offer office-hours for addressing questions.

Full Time Block - Depending on how you plan your project and your progress we can have a two to three week long intensive block during or at the end of your semester.

Submission - At the end of this project, you will need to provide some deliverables that include research questions answered, features implemented, code documentation, and testing. Ideally, this would evolve as your implementations progresses during the semester. We expect you to write a report and give a presentation of your work in our research group as well as to your peers at the final conference at the end of the semester.

Skills Gained

By participating in this project, you will improve your programming skills, gain knowledge in data analysis for proteomics, and gain experience in deep learning. You will also gain a solid basis in software engineering and development that will be very helpful for your career, both in academia and in the industry.

Organisation

Programming language Python

Must-have skills

Intermediate programming skills in Python
Basic understanding of machine learning concepts
Interest in applied research and writing good quality code
Interest in life sciences (or specifically proteomics)

Good-to-have skills

Knowledge of good software engineering practices
Knowledge of common and state-of-the-art deep learning architectures (Transformers, RNNs, CNNs)
Knowledge of TensorFlow/Keras and PyTorch
Basic skills in Git and GitHub

Team Size 3-5 Participants

Supervisors

Grading Presentation [15 minutes whole team] and small thesis [5-10 pages including everything] – everyone with some contribution

Submission Git branch/commit/pull request and documentation; report including a detailed use case

Location/Rooms We are very flexible with regards to the time slots and location. As of right now, we expect phase 1-3 will take place either in Freising/Weihenstephan or via Zoom. Lecture preferably in the afternoon, project meetings preferably in the evening ~17:00. We are flexible with respect to the day and will decide with you input. We also have a student room that you can use but you may also work from home if you so desire. We would like to welcome you in our institute. The full time block will take place - if required - in Freising/Weihenstephan. The exact weeks will be decided upon once we know if it is required. Online participation is possible for all part except the full time block.

Material

All materials are made available in TUM Moodle.

Literature

GitHub repos of relevance:

DLOmix framework *
DLOmix resource with Colab examples *
PROSPECT training data

Currently supported models:

Training data format:

PROSPECT: Labeled Tandem Mass Spectrometry Dataset for Machine Learning in Proteomics *

Other similar publications:

AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics

Marked with * are the key references