Colocalization of GWAS and eQTL Signals Detects Target Genes

Farhad Hormozdiari recently developed a method for combining genome-wide association studies (GWASs) and quantitative trait loci (eQTL) studies in a statistical framework that quantifies the probability of each variant to be causal while allowing an arbitrary number of causal variants. Together with collaborators at the University of Oxford and Broad Institute of MIT and Harvard, we present a paper in The American Journal of Human Genetics. Here, we describe eQTL and GWAS CAusal Variants Identification in Associated Regions (eCAVIAR). We apply our approach to datasets from several GWASs and eQTL studies in order to assess its accuracy and potential contributions to colocalization and fine-mapping.

Integrating GWASs and eQTL studies is a promising way to explore the mechanism of non-coding variants on diseases. Integration of GWAS and eQTL data is challenging due to the uncertainty induced by linkage disequilibrium (LD), the non-random association of alleles at different loci, and presence of loci that harbor multiple causal variants (allelic heterogeneity). Current methods assume that each locus contains a single causal variant and expect loci to be independent and associated randomly.

eCAVIAR is a novel probabilistic model for integrating GWAS and eQTL data that extends the CAVIAR (Hormozdiari et al. 2014) framework to explicitly estimate the posterior probability of the same variant being causal in both GWAS and eQTL studies, while accounting for allelic heterogeneity and LD. Our approach can quantify the strength between a causal variant and its associated signals in both studies, and it can be used to colocalize variants that pass the genome-wide significance threshold in GWAS. For any given peak variant identified in GWAS, eCAVIAR considers a collection of variants around that peak variant as one single locus.

We apply eCAVIAR to the Meta-Analyses of Glucose and Insulin-related traits Consortium (MAGIC) dataset and GTEx dataset to detect the target gene and most relevant tissue for each GWAS risk locus. When applied to the MAGIC dataset’s 2 phenotypes, eCAVIAR identifies genetic variants that are causal in both eQTL and GWAS. Further, eCAVIAR detects a large number of loci where the GWAS causal variants are clearly distinct from the causal variants in the eQTL data. Interestingly, eCAVIAR also identifies genes that colocalize in one tissue yet can be excluded in others. For the majority of loci in which we identify a single variant causal for both GWAS and eQTL, eCAVIAR implicates more than one causal variant across the 45 tissues.

We observe that eCAVIAR outperforms existing methods even when there are different values of non-colocalization. Using simulated datasets, we compared accuracy, precision, and recall rate of eCAVIAR to RTC (Nica et al. 2010) and COLOC (Giambartolomei et al. 2014), two current methods for eQTL and GWAS colocalization. Our results show that eCAVIAR has high confidence for selecting loci to be colocalized between the GWAS and eQTL data and is conservative in selecting a locus to be colocalized.

We hope that future applications of eCAVIAR will advance identification of specific GWAS loci that share a causal variant with eQTL studies in a tissue, thus providing insight into presently unclear disease mechanisms.


Overview of eCAVIAR.


eCAVIAR was created by Farhad Hormozdiari, Ayellet V. Segre, Martijn van de Bunt, Xiao Li, Jong Wha J Joo, Michael Bilow, Jae Hoon Sul, Bogdan Pasaniuc and Eleazar Eskin. The article is available at:

Visit the following page to download CAVIAR and eCAVIAR:

The full citation to our paper is:

Hormozdiari, Farhad; van de Bunt, Martijn; Segrè, Ayellet V; Li, Xiao; Joo, Jong Wha J; Bilow, Michael; Sul, Jae Hoon; Sankararaman, Sriram; Pasaniuc, Bogdan; Eskin, Eleazar

Colocalization of GWAS and eQTL Signals Detects Target Genes. Journal Article

In: Am J Hum Genet, 2016, ISSN: 1537-6605.

Abstract | Links | BibTeX

Our paper builds upon a method introduced in a previous publication:

Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar

Identifying causal variants at Loci with multiple signals of association. Journal Article

In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631.

Abstract | Links | BibTeX

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping

Meta-analyses of genome-wide association studies (GWASs) have become essential to identifying new loci associated with human diseases. We recently developed a novel framework that improves the accuracy and power of meta-analyses, which we describe in our recent Human Molecular Genetics paper. This framework can be applied to the fixed effects (FE) model, which assumes that effect sizes of genetic variants are constant across studies, and the random effects (RE) model, which assumes that effect sizes can be different among studies.

Almost all GWAS publications today employ meta-analysis methodologies, the majority of which assume that component studies are independent and that individuals among studies are unrelated. Yet many studies today use shared controls to reduce genotyping or sequencing cost. These “shared control” individuals can inadvertently overlap between multiple studies and, if not accounted for in the methodology, induce false associations in GWAS results. Most meta-analysis tools, including the RE model, cannot account for these overlapping subjects.

In our paper, we propose a general framework for adjusting association statistics to account for overlapping subjects within a meta-analysis. The key idea of our method is to transform the covariance structure of the data so it can be used in methods that strictly assume independence between studies. Specifically, our method decouples dependent studies into independent studies and adjusts association statistics to account for uncertainties in dependent studies. As a result, our approach enables general meta-analysis methods, including the FE and RE models, to account for overlapping subjects. Existing pipelines implementing these models can be reused for dependent studies if our framework is applied at the front end of the analysis procedure.


A simple example of our decoupling approach. Ω and ΩDecoupled are the covariance matrices of the statistics of three studies A, B and C before and after decoupling, respectively. The thickness of the edges denotes the amount of correlation between the studies. After decoupling, the size of the nodes reflects the information that the studies contain in terms of the inverse variance.

We tested our framework for accuracy and power with five simulated datasets, each containing 1000 to 5000 individuals and 10,000 shared controls. A standard approach produced an inflated number of false positive. Our decoupling method, which systemically accounts for overlapping individuals in meta-analysis, and a standard splitting method, which splits controls into individual studies, both correctly controlled for type 1 errors. The advantage of our framework is apparent when assessing power; in one scenario, we gained 25% power in accounting for overlapping subjects with the decoupling when compared to the splitting method.

Next, we assessed the potential of our framework in identifying casual loci shared by multiple diseases and leveraging information from multiple tissues to increase power for eQTL identification. The decoupling and splitting methods controlled false-positive rates and produced significant p-values at several previously identified candidate shared loci among the three autoimmune conditions present in the Wellcome Trust Case Control Consortium (WTCCC) data. In comparison to the splitting method, our decoupling framework increased the significance of p-values in the shared loci test and increased the number of discovered eQTLs by 19%.

Our approach is flexible and allows many meta-analysis methods, such as the RE model, to account for dependency between studies and overlapping subjects. We developed this approach to complement standard software packages in the meta-analysis of GWAS. This project was led by Buhm Han and involved Dat Duong and Jae Hoon Sul. The article is available at:

The full citation to our paper is:

Han, Buhm; Duong, Dat; Sul, Jae Hoon; de Bakker, Paul I W; Eskin, Eleazar; Raychaudhuri, Soumya

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping. Journal Article

In: Hum Mol Genet, 2016, ISSN: 1460-2083.

Abstract | Links | BibTeX

Using genomic annotations increases statistical power to detect eGenes

Our group developed a novel method for detecting eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Identification of eGenes is increasingly important to studies of expression quantitative trait loci (eQTLs), the genetic variants that affect gene expression. Mapped eGenes help guide eQTL studies of complex human disease. However, standard approaches cannot efficiently detect these complex features in today’s large genomic datasets. Func-eGene, which we describe and test in a recent Bioinformatics paper, significantly increases the statistical power of existing association study methods and detects more eGenes in comparison to standard approaches.

Standard statistical methods for classifying a gene as an eGene first perform association testing at all variants near the gene of interest, then use a permutation test to conduct multiple-testing correction for results. The permutation test effectively corrects for potential biases introduced by multiple testing and obtains a p value for each gene. However, the permutation test is computationally inefficient when processing the increasingly large sample sizes of today’s eQTL datasets and has become a computational bottleneck in eQTL studies.

Our new approach, Func-eGene, incorporates genomic annotation of variants to improve the computing power of eQTL studies. Variants located near gene transcription sites (TSSs), or near some histone modifications, often regulate gene expression. Standard approaches do not consider genomic annotations, but we found that annotation of these variants can help locate and associate more causal variants using less time and computing power. In order to do this, we expand upon the standard multithreshold association test that specifies different significance thresholds for each variant when correcting for multiple testing. Func-eGene increases power by assigning lower significance thresholds to variants that are likely to contribute to gene expression.

However, this association test still depends on the time-consuming permutation test and requires a known prior based on annotation for genetic variants. Func-eGene avoids these difficulties by reducing runtime and selecting an appropriate prior. To reduce runtime, we replace the permutation test with the Mvn-sampling procedure described in Sul et al. (2015). To find an appropriate prior, we run a grid search over possible sets of scores assigned to annotation categories. Func-eGene then seeks a set of scores that maximizes the number of eGenes and uses a cross-validation strategy to avoid data re-use and over-fitting. Thus, there are two ways to apply Func-eGene to eQTL data. Permutation Func-eGene uses the traditional permutation test to calculate the null density of the observed statistic, whereas Mvn Func-eGene relies on the Mvn-sampling procedure.

We applied our method to the liver Genotype-Tissue Expression (GTEx) dataset. We used genomic annotations of the following variants: distance from TSSs, DNase hypersensitivity sites, and six histone modifications. Notably, the distance from TSS annotation detected the highest number of candidate eGenes; using this annotation, our new method discovered 50% more candidate eGenes when compared to the standard permutation method. Our simulations show that Func-eGene successfully control the rate of false-positive associations when using either the permutation or the Mvn procedure. However, implementing Func-eGene with a traditional permutation test is inefficient. Instead, we can obtain the same results with considerably faster runtime when using Mvn sampling.


Graphs comparing eGene detection and statistical power of permutation and mvn approaches. (a) Q–Q plots of the uniform density quantiles against the simulated eGene P-value quantiles using Func-eGene at the gene ENSG00000204219.5 under the null hypothesis. (b) Func-eGene simulated statistical power at the gene ENSG00000204219.5


This project was led by Dat Duong and involved Jennifer Zou, Farhad Hormozdiari, and Jae Hoon Sul. The article is available at:

The full citation to our paper is: 

Duong, Dat ; Zou, Jennifer ; Hormozdiari, Farhad ; Sul, Jae Hoon ; Ernst, Jason ; Han, Buhm ; Eskin, Eleazar

Using genomic annotations increases statistical power to detect eGenes. Journal Article

In: Bioinformatics, 32 (12), pp. i156-i163, 2016, ISSN: 1367-4811.

Abstract | Links | BibTeX

FUNC-eGene was developed by Dat Duong and is available for download at: