Advanced Practical Course Bioinformatics - 2022

Modeling cell viability with expression data

Topic

The large majority of drugs (medicine) used to date act on proteins. However, particularly in cancer, different cohorts of patients respond differently to first-line treatments. While personalized treatments promise to circumvent current shortcomings in standardized treatments, tools for the personalized treatment suggestion are sparse. Over the last decade, we collected hundreds of proteomes from cancer cell lines in ProteomicsDB, alongside their response to drug treatments in the form of viability data. In previous research, we have successfully combined the two data streams using elastic nets to be able to get insight into potential sensitivity and resistance markers. This information can be used to assist personalized treatment suggestions.

Aim

The aim of the project is to develop an online application that allows us and external users to study sensitivity and resistance markers discovered by e.g. elastic net models. For this purpose, participants of this course will develop a pipeline that retrieves data from ProteomicsDB, applies machine learning (e.g. elastic net), stores the resulting models in ProteomicsDB. For result visualization, a web-based user interface has to be developed that allows us and others to explore the trained models and their results. The application (pipeline and intreface) will be made available on ProteomicsDB and thus may also allow users to perform viability prediction on custom data. We want to answer the following research questions:

Which proteins are markers for drug resistance/sensitivity? We want to systematically train machine learning models (i.e. 100 bootstrap elastic net models) for each drug on the data that we have available in ProteomicsDB (100s-1000s). The findings need to be presented in a UI by visualization. Additionally we want to create an interface for users to explore individual markers for a given drug. Furthermore we expect you to explore and quantify the effect of missing values on the drug response prediction.

How do markers behave for similar drugs? Your task will be to explore similarities and differences between drug response predictions for classes of similar drugs. Given drugs that target the same or similar pathways we expect to see similar resistance and sensitivity markers. To answer this question, you will also need to differentiate between markers that are drug targets versus such that show an indirect effect.

How does my cell line respond to a drug? We want to offer the possibility to perform predictions using data from individual users. Herefore, you will extend the pipeline and user interface. Data submission to ProteomicsDB is already available. In this context, we are also interested in which drug targets are responsible for cells dying.

Can we generalize the approach (optional)? Can we extend the pipeline developed in the first research question to allow model training/fitting for other data sources? Do the predictions work with other data than expression? We anticipate to reuse the pipeline for e.g. protein-drug binding affinity or transcriptomics data.

General Schedule

Phase 1: Methods, Tools, Techniques

The first project phase consists of a series of seminars in which you will learn the basics of various topics that are covered as part of this project. The seminar will contain lecture type presentations by us, practical sessions where applicable and short presentations prepared by the students:

Kickoff Seminar - In the first seminar we will discuss organizational stuff and give you insight into the project. We will show you where our data comes from, how to interpret and work with it (proteomics and cell viability).

DB + SAP HANA - The second part will deal with how you get acess to ProteomicsDB. You will get insight in our infrastructure and how we use SAP HANA. This will include a recap on SQL and databases.

Machine Learning - The third part deals with machine learning techniques we apply. Mainly elastic nets, bootstrapping, hyperparameter optimization and interpretation.

UI + Vue.js - The last part deals with our server infrastructure and how you can work with it to create a user interface that researchers can make use of.

Additional Topics - In case you want to get deeper knowledge we are open to hold an additional seminar with a topic of your choice. You decide.

Phase 2: Research project planning

In the second phase, we want to you to prepare a detailed project plan. At the end of this phase, you will present your plan and discuss it with us. We will assist you during planning of your project and provide you feedback to make sure that you are able to bring this project to a success. Most importantly, you should discuss the following points:

Requirement analysis You as a group will have to discuss and plan what exact research questions proposed by us (or by you) will be covered and by whom. This will inlcude a requirement analysis of the entire application (machine learning, storage and user interface).

Organization Projects work best if well organized. You will be required to come up with milestones and issues and show us how you want to distribute individual tasks between team members and over time. We strongly encourage you to use available project managment tools to keep track of your work. We can recommend Monday.com for your project management. Communication in our lab is conducted via Slack. You can setup milestones and issues in your gitlab repository. You should also have a look at project management methods such as Waterfall, Agile or SCRUM.

Frameworks and Languages We make use of SAP Hana Studio and MySQL WorkBench to manage our DB. Bare in mind that not all researchers are bioinformaticians. An easy to use and interpret user interface is most important for the accesibility and applicability of your tool which is why we provide a frontend written with Vue.js. Familiarize yourself with these tools and decide on the programming language you want to use (R / Python). Discuss how you want to integrate your models with our existing infrastructure.

Research plan You need to come up with a plan on how you want to organize answering a few questions using the tools you create. We do have a few questions in mind that we would like to answer (see above section 'Aim') but you are also free to come up with your own research questions.

Phase 3: Implementation and Research

This is the main phase of your project. According to your plan, you will implement and deploy your work. We will hold weekly meetings during that time in which you need to present your progress in the form of a max. 15 minute short presentation. We are flexible but prefer monday mornings at 10:00 am (CEST).

Semester Work - We would like you to work during your semester. You do not need to come in person all the time although we provide a large room where you can work. However, we strongly encourage you to come twice every week for a full day. If you are on-sight, we will be able to assist and guide you more directly and questions that may arise can be addressed without delays. Ideally, you can implement your models, provide a functioning user interface on our website and answer our primary research questions by the end of that part.

Full Time Block - Depending on how you plan out your project and your progress we can have a two to three week long intensive block during or at the end of your semester.

Submission - At the end of this project, you will need to package your work (i.e. installable python / R package) including documentation. Ideally you do that on the fly during your implementation. We expect you to write a report and give a presentation of your work in our research group as well as to your peers at the final conference at the end of the semester.

Skills Gained

This is a project that requires the full interdisciplinary skillset of a bioinformatician from understanding different types of data, over building complex models using machine learning techniques to providing a finished software product using state of the art technologies in web development and industry leading database solutions. By participating in this project, you will therefore improve your existing programming skills, gain knowledge in data analysis for proteomics and phenomics studies and get in touch with precision oncology. You will also gain a solid basis in software engineering and deployment that will be helpful for your career both in academia and the private sector.

Organisation

Programming Tools Python, R, SQL, JS, Vue.js, (SAP HANA)

Team Size 3-5 Participants

Preferable Skills If you anticipate to enroll into this practical course, you should have a strong interest in (or may have previously even worked with) one of the following: SQL/SAP Hana, JS/Vue, Unix/Server Administartion, Python/Machine Learning.

Supervisors

Ludwig Lautenbacher (M.Sc. MBT)
Mario Picciani (M.Sc. Bioinformatics)

Grading Presentation and small thesis [5-10 pages including everything] – everyone with some contribution

Submission Git repository and documentation; light-weight user manual; report including a detailed use case

Location/Rooms Phase 1-3 will be either one site in Freising/Weihenstephan or via Zoom. We have a student room that you can use but you may also work from home if you so desire. We would like to welcome you in our institute so long as you obey the rules imposed by political decisions related to the COVID-19 pandemic.

Material

All materials are made available in TUM Moodle.