Machine and deep learning
Machine learning is the study of algorithms and statistical models that computers use to perform a specific task without using explicit instructions. The goal of machine learning is to find patterns in data to infer the label (classification) of an item or to predict the quantity of a property (regression).
Classical machine learning utilizes methods such as support vector machines and random forests for learning. These methods have a long history in proteomics because they do not rely on the availability of huge amounts of data and found applications in e.g. differentiating correctly from incorrectly assigned peptide spectrum matches, labeling normal and disease samples or predicting the outcome of phenotypic assays. For most of our projects, we rely on these basic machine learning tools, as they typically allow us to obtain knowledge about what factors are important and how we could improve our data or methods.
With proteomics entering the high throughput area, the amount of data available grew exponentially and nowadays allows the application of deep learning, specifically on tasks related to the acquired spectra. One of our flagship projects is Prosit (https://www.proteomicsdb.org/prosit/), a deep learning architecture for learning various peptide properties (e.g. fragmentation and retention time). The training data originated from the ProteomeTools (http://www.proteometools.org/), high-quality in-house reference datasets with peptide properties acquired from more than 1.3 million synthetic peptides. Prosit was developed using TensorFlow and Keras. Besides Prosit, we are currently investigating the application of deep learning to predict e.g. phenotypic outcomes of viability assays or the target proteins of protein kinase drugs. For training, we have access to two in-house GPU servers with three GPUs, an external GPU server attached to ProteomicsDB and the openPower provided by the UCC in Garching.
Available Projects:
- Kinase substrate prediction - MA, BA, Internship
- Similarity between cell line and patient proteome profiles - MA