Statistical methods
Unravelling complex trait architecture using DNA sequencing data
Christian Stricker, Chris-Carolin Schön
Project duration: 01.11.2016 - 31.10.2019
Project partner: Martin Schlather, Universität Mannheim
Funding: Deutsche Forschungsgemeinschaft (DFG)
Project description:
Advances in molecular biology have led to the availability of massive collections of genomic data in human, animals, and plants. These data, together with phenotypic and genealogical information on a large number of individuals promise new biological insights on genetic mechanisms controlling complex traits. Despite the constantly increasing size of experimental data sets, one of the most prominent problems of genomic data analysis is that the number of unknown parameters in the statistical models exceeds - often by far - the available sample size (n << p setting). Various Bayesian linear regression models with proper priors (producing shrinkage of coefficients) that can be applied when the number of covariates or predictors is larger than the number of observations have been suggested in the context of genome-wide prediction of genetic values. However, the performance of these methods with respect to inference on model parameters and identification of functional mutations in sequence data is largely unknown.
The overall objective of this study is to identify optimal Bayesian regression methods for inference on marker effects in high-dimensional data. We will investigate the sensitivity and Bayesian learning properties of a new generation of models with different prior distribution settings, including different types of mixture distributions, in simulated and experimental data. Whole-genome regression models will be enhanced by biologically driven grouping of markers prior to model fitting. Implementation issues, such as computational effectiveness and behavior of the associated MCMC algorithms will be explored and guidelines for assessing the uncertainty of inferences and the Bayesian sensitivity with respect to different priors used in hierarchical models will be provided. By studying more than 1000 sequenced Arabidopsis thaliana accessions for which high quality phenotypic data are available, the ability of whole-genome regression models to learn about statistical trait architecture (e.g., genomic regions involved, effect sizes, contributions to variance) will be assessed and compared to biological prior knowledge. The results from the project will allow identifying optimal statistical methods to harness the large amount of genomic data which is already available or upcoming for many plant species.