Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure

Jong Wha (Joanne) Joo developed an approach to simultaneously analyze multiple phenotypes in a genome-wide association studies (GWAS) dataset. She introduces this new methodology, referred to as GAMMA (Generalized Analysis of Molecular variance for Mixed model Analysis), in a paper recently published in Genetics.

GWASs have identified many genetic variants involved in traits and development of human diseases by examining for correlation of a single phenotype and individual genotype one phenotype at a time. Since initial development of the standard GWAS approach, GWAS data collection has become larger in scale and higher in resolution. Today’s large-scale datasets include expression data and often contain thousands of phenotypes per individual. Performing the standard single-phenotype analysis on these datasets is slow and potentially fails to detect unmeasured aspects of complex biological networks.

Analyzing many phenotypes simultaneously increases the power to detect more variants and capture previously unmeasured aspects of the genome. However, standard GWAS approaches capable of simultaneously testing multiple phenotypes fail to account for the distorting effects of population structure, a phenomenon present in large cohorts that inevitably contain individuals sharing common ancestry from multiple populations. As a result, standard GWAS approaches either fail to detect true effects or produce many false positive identifications.

GAMMA is an efficient, robust approach capable of simultaneously analyzing many phenotypes while correcting for population structure. GAMMA uses the principles behind existing linear mixed models to analyze for many phenotypes simultaneously and a multiple regression technique to correct for population structure.

Joanne’s paper presents the results of testing GAMMA for accuracy in three scenarios: a simulated dataset containing population structure, a yeast dataset containing many trans-regulatory hotspots, and a complex gut microbiome dataset. In the simulated study using data implanted with true population structure effects, GAMMA accurately identifies these true effects without producing false positives. In the simulation with yeast data, GAMMA successfully corrected for the bias of technical artifacts such as batch effects and identified significant signals on most of the putative hotspots. In the third test, Joanne and her team assesses GAMMA’s ability to perform a multiple-phenotypes analysis with microbiome data. Here, results identified nine loci likely to have true biological mechanisms in the taxa.

In each scenario, results of GAMMA were compared to those of the standard t-test, EMMA, and MDMR. The standard t-test and EMMA failed to identify true variants, because the phenotypic effects in each example is smaller than the amount these methods are powered to detect. MDMR produced no significant signals in the yeast dataset and identified many false associations in the simulated and gut microbiome datasets. Both GAMMA and MDMR have sufficient power to detect small association signals in these complex datasets, but only GAMMA successfully corrects for population structure.

This project was led by Joanne Joo and involved Eun Yong Kang and Farhad Hormozdiari. The article is available at: http://www.genetics.org/content/204/4/1379.

GAMMA was developed by Joanne Joo, Eun Yong Kang, Elin Org, Nick Furlotte, Brian Parks, Aldons J. Lusis, and Eleazar Eskin. Visit the following page to download GAMMA: http://genetics.cs.ucla.edu/GAMMA/

The full citation to our paper is:

Joo, Jong Wha J; Kang, Eun Yong; Org, Elin; Furlotte, Nick; Parks, Brian; Hormozdiari, Farhad; Lusis, Aldons J; Eskin, Eleazar

Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure. Journal Article

In: Genetics, 204 (4), pp. 1379-1390, 2016, ISSN: 1943-2631.

Abstract | Links | BibTeX

The results of GAMMA and three standard GWAS methods applied to a simulated dataset. The x-axis shows SNP locations and the y-axis shows log10p-value of associations between each SNP and all the genes. Blue arrows show the location of the true trans-regulatory hotspots.

Colocalization of GWAS and eQTL Signals Detects Target Genes

Farhad Hormozdiari recently developed a method for combining genome-wide association studies (GWASs) and quantitative trait loci (eQTL) studies in a statistical framework that quantifies the probability of each variant to be causal while allowing an arbitrary number of causal variants. Together with collaborators at the University of Oxford and Broad Institute of MIT and Harvard, we present a paper in The American Journal of Human Genetics. Here, we describe eQTL and GWAS CAusal Variants Identification in Associated Regions (eCAVIAR). We apply our approach to datasets from several GWASs and eQTL studies in order to assess its accuracy and potential contributions to colocalization and fine-mapping.

Integrating GWASs and eQTL studies is a promising way to explore the mechanism of non-coding variants on diseases. Integration of GWAS and eQTL data is challenging due to the uncertainty induced by linkage disequilibrium (LD), the non-random association of alleles at different loci, and presence of loci that harbor multiple causal variants (allelic heterogeneity). Current methods assume that each locus contains a single causal variant and expect loci to be independent and associated randomly.

eCAVIAR is a novel probabilistic model for integrating GWAS and eQTL data that extends the CAVIAR (Hormozdiari et al. 2014) framework to explicitly estimate the posterior probability of the same variant being causal in both GWAS and eQTL studies, while accounting for allelic heterogeneity and LD. Our approach can quantify the strength between a causal variant and its associated signals in both studies, and it can be used to colocalize variants that pass the genome-wide significance threshold in GWAS. For any given peak variant identified in GWAS, eCAVIAR considers a collection of variants around that peak variant as one single locus.

We apply eCAVIAR to the Meta-Analyses of Glucose and Insulin-related traits Consortium (MAGIC) dataset and GTEx dataset to detect the target gene and most relevant tissue for each GWAS risk locus. When applied to the MAGIC dataset’s 2 phenotypes, eCAVIAR identifies genetic variants that are causal in both eQTL and GWAS. Further, eCAVIAR detects a large number of loci where the GWAS causal variants are clearly distinct from the causal variants in the eQTL data. Interestingly, eCAVIAR also identifies genes that colocalize in one tissue yet can be excluded in others. For the majority of loci in which we identify a single variant causal for both GWAS and eQTL, eCAVIAR implicates more than one causal variant across the 45 tissues.

We observe that eCAVIAR outperforms existing methods even when there are different values of non-colocalization. Using simulated datasets, we compared accuracy, precision, and recall rate of eCAVIAR to RTC (Nica et al. 2010) and COLOC (Giambartolomei et al. 2014), two current methods for eQTL and GWAS colocalization. Our results show that eCAVIAR has high confidence for selecting loci to be colocalized between the GWAS and eQTL data and is conservative in selecting a locus to be colocalized.

We hope that future applications of eCAVIAR will advance identification of specific GWAS loci that share a causal variant with eQTL studies in a tissue, thus providing insight into presently unclear disease mechanisms.


Overview of eCAVIAR.


eCAVIAR was created by Farhad Hormozdiari, Ayellet V. Segre, Martijn van de Bunt, Xiao Li, Jong Wha J Joo, Michael Bilow, Jae Hoon Sul, Bogdan Pasaniuc and Eleazar Eskin. The article is available at: http://www.cell.com/ajhg/abstract/S0002-9297(16)30439-6.

Visit the following page to download CAVIAR and eCAVIAR: http://genetics.cs.ucla.edu/caviar/

The full citation to our paper is:

Hormozdiari, Farhad; van de Bunt, Martijn; Segrè, Ayellet V; Li, Xiao; Joo, Jong Wha J; Bilow, Michael; Sul, Jae Hoon; Sankararaman, Sriram; Pasaniuc, Bogdan; Eskin, Eleazar

Colocalization of GWAS and eQTL Signals Detects Target Genes. Journal Article

In: Am J Hum Genet, 2016, ISSN: 1537-6605.

Abstract | Links | BibTeX

Our paper builds upon a method introduced in a previous publication:

Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar

Identifying causal variants at Loci with multiple signals of association. Journal Article

In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631.

Abstract | Links | BibTeX

Multiple testing correction in linear mixed models

Our group recently published a new paper on multiple testing applied to genetic studies with population structure.  This project was led by Jong Wha (Joanne) Joo and also involved Farhad Hormozdiari.  The project was joint with Buhm Han’s group.  The approach built upon Buhm Han’s previous work SLIDE (Han et al. 2009; Han and Eskin 2012).
Genome-wide association studies (GWAS) have discovered many variants that are associated with complex traits in the human genome. In GWAS, researchers collect both phenotypic information and genetic information on variants spread through the genome from a population. In order to identify the set of variants associated with a trait of interest, we assess correlations between the phenotype and the genetic information at each variant, which we call the genotype. GWAS are now routinely performed on tens of thousands of individuals—and millions of genetic variants.
GWAS methodology must address specific problems that are tied to this exceptionally large scale of analysis. One major challenge in GWAS is multiple hypothesis testing. In routine analyses, the significance of hypothesis testing is assessed using the p value as a per-marker threshold. However, GWAS involves computing up to millions of statistical tests in a single study. When using traditional association study techniques, multiple hypothesis testing can generate false positives or spurious associations, and p value threshold for significance must be adjusted to control the overall false positive rate.
Several approaches are useful in correcting these potential pitfalls, including Bonferroni correction and permutation test.
Recently, researchers have accepted the linear mixed model (LMM) as standard practice for performing GWAS. The LMM can address two important challenges in GWAS: population structure and insufficient power. Population structure refers to the complex relatedness structure among individuals, which can drive errors in data reporting such as false positives. In many cases, LMM approaches can increase the statistical power and avoid generating false positives by explicitly modeling the population structure’s genetic relationships. Nonetheless, multiple hypothesis testing with LMM approaches may generate some errors of association. Unfortunately, the current approaches for multiple hypothesis testing correction cannot be applied to LMM.  This is because population structure actually affects the correlation structure of the statistics as we show in the paper.
To address this issue, we developed the first gold standard approach for multiple hypothesis testing correction in LMM. This method, called multiple testing in transformed space (MultiTrans), can efficiently correct for multiple testing in LMM approaches. MultiTrans is a parametric bootstrapping resampling approach that is the equivalent of the permutation test. Specifically, our approach samples randomized null phenotypes from the distribution fitted by LMM.
Straightforward parametric bootstrapping where phenotypes are sampled is prohibitively computationally expensive.  MultiTrans instead utilizes   a Multivariate Normal Distribution to directly samples the association statistics.  The figure shows an overview of our methodology.
The full citation to our paper is:

Joo, Jong Wha J; Hormozdiari, Farhad; Han, Buhm; Eskin, Eleazar

Multiple testing correction in linear mixed models. Journal Article

In: Genome Biol, 17 (1), pp. 62, 2016, ISSN: 1474-760X.

Abstract | Links | BibTeX

Multiple hypothesis testing is an essential step in GWAS analysis. The correct per-marker threshold differs as a function of species, marker densities, genetic relatedness, and trait heritability—and no previous multiple testing correction methods can comprehensively account for these factors. The method we developed to address this issue, MultiTrans, is an efficient and accurate multiple testing correction approach for LMM. Our method (a) performs a unique transformation of genotype data to account for actual genetic relatedness and heritability under LMM approaches, and (b) efficiently utilizes the multivariate normal distribution. Using MultiTrans, we accurately estimated per-marker thresholds in mouse, yeast, and human datasets—while reducing computation time from months to hours.