Incorporating prior information into association studies

Genome-wide association studies (GWAS) seek to identify genetic variants involved in specific traits. GWAS are advantageous for linking variants with traits, because they interrogate the genome in a uniform way. In other words, they examine the whole genome without a preconceived notion of where the associations may lie.

However, we now know a lot about the putative function of genetic variants due to tremendous progress in functional genomics. In many cases, we even know which variants are more likely to be involved in disease when compared to others. Advancements in our understanding of functional genomics motivate the strategic incorporation of prior information in GWAS.

Our group has been interested in this problem for many years. One challenge to addressing this problem is that the widely utilized approach for GWAS involves evaluating an association statistic at each single nucleotide polymorphism (SNP), and these methods take into account only one SNP at a time. The results are then adjusted for multiple testing, and an association is identified if a statistic exceeds a certain threshold. This approach can be described as a frequentist approach. On the other hand, one can incorporate prior information on which SNPs are likely to be the causal variants affecting the trait. This approach is inherently a Bayesian concept. Reconciling these two approaches is not straightforward.

Average power under varying relative risks. For more information, see our paper.

In a 2008 paper published in Genome Research, our group proposed a modification of the multiple testing framework to address this problem. Instead of using the same specific threshold for all of the association statistics, we use a different threshold for each association statistic, where the thresholds are adjusted based on the prior information. Our method takes advantage of the correlation structure by considering multiple markers within a region. In our paper, we demonstrate how to set the thresholds in order to optimally utilize prior information and maximize statistical power.

Using prior information in genetic association studies increases power over traditional association studies while maintaining the same overall false-positive rate. Compared to standard methods, our approach is equally simple to apply to association studies, produces interpretable results as p-values, and is optimal in its use of prior information in regards to statistical power.

In 2012, we extended this work to use only tag SNPs for the putative causal variant. This project was developed by Gregory Darnell (then UCLA undergraduate, now PhD student at Princeton University), Dat Duong (then UCLA undergraduate, now UCLA PhD student), and Buhm Han.

More recently, we have applied this framework to incorporate functional information in analysis of eQTL data. In this case, incorporating genomic annotation of variants significantly increases the statistical power of existing eQTL methods and detects more eGenes in comparison to standard approaches. Read the blog post on this paper, and download the full article.

For more information on our general approach, see our paper, which is available for download through Bioinformatics:
In addition, the open source implementation of our 2012 paper, MASA, which was developed by Greg Darnell and Dat Duong, is freely available for download at

The full citations to our papers on this topic are:

Darnell, Gregory; Duong, Dat ; Han, Buhm ; Eskin, Eleazar

Incorporating prior information into association studies. Journal Article

In: Bioinformatics, 28 (12), pp. i147-i153, 2012, ISSN: 1367-4811.

Abstract | Links | BibTeX

Eleazar Eskin. “Increasing Power in Association Studies by using Linkage Disequilibrium
Structure and Molecular Function as Prior Information.” Genome Research.
18(4):653-60 Special Issue Proceedings of the 12th Annual Conference on Research
in Computational Biology (RECOMB-2008), 2008.

ZarLab goes to Vancouver for ASHG!


Last week many members of our group traveled to Vancouver, British Columbia, for the annual meeting of the American Society of Human Genetics. The 66th Annual Meeting, which took place October 18-22, 2016, featured over 3000 talks, workshops, and poster presentations on topics such as bioinformatics and computational methods, developmental genetics and gene function, cancer and cardiovascular diseases, evolutionary and population genetics, and genetic counseling.

ZarLab contributed 8 poster presentations and one research talk. Serghei Mangul discussed his recent work on dumpster-diving techniques in a talk titled, “Comprehensive analysis of RNA-sequencing to find the source of every last read across 544 individuals from 53 tissues,” as part of the Interpreting the Transcriptome in Health and Disease symposium. You can view his slides here:

ZarLab in Vancouver!

ZarLab in Vancouver!

Recent alumni Farhad Hormozdiari received a Reviewers’ Choice ribbon for his poster titled, “Joint fine mapping of GWAS and eQTL detects target gene and relevant tissue.” Only the top 10% of posters by topic receive this honor, as determined by the reviewers’ scores of the submitted abstracts. Congratulations, Farhad!

Other posters presented by members of our group:

  • Prevalence of allelic heterogeneity in complex traits. Eleazar Eskin
  • Modeling the covariance of effect sizes in a meta-analysis. Dat Duong
  • Estimating regional heritability in the presence of linkage disequilibrium. Lisa Gai
  • linear mixed models for quantitative traits in health-system scale data. Michael Bilow
  • Utilizing allele specific expression to identify cis-regulatory variants. Jennifer Zou
  • Haplotype-based predictors for complex trait association. Rob Brown
  • Repeat elements expression profile across different tissues in GTEx samples. Harry Yang

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping

Meta-analyses of genome-wide association studies (GWASs) have become essential to identifying new loci associated with human diseases. We recently developed a novel framework that improves the accuracy and power of meta-analyses, which we describe in our recent Human Molecular Genetics paper. This framework can be applied to the fixed effects (FE) model, which assumes that effect sizes of genetic variants are constant across studies, and the random effects (RE) model, which assumes that effect sizes can be different among studies.

Almost all GWAS publications today employ meta-analysis methodologies, the majority of which assume that component studies are independent and that individuals among studies are unrelated. Yet many studies today use shared controls to reduce genotyping or sequencing cost. These “shared control” individuals can inadvertently overlap between multiple studies and, if not accounted for in the methodology, induce false associations in GWAS results. Most meta-analysis tools, including the RE model, cannot account for these overlapping subjects.

In our paper, we propose a general framework for adjusting association statistics to account for overlapping subjects within a meta-analysis. The key idea of our method is to transform the covariance structure of the data so it can be used in methods that strictly assume independence between studies. Specifically, our method decouples dependent studies into independent studies and adjusts association statistics to account for uncertainties in dependent studies. As a result, our approach enables general meta-analysis methods, including the FE and RE models, to account for overlapping subjects. Existing pipelines implementing these models can be reused for dependent studies if our framework is applied at the front end of the analysis procedure.


A simple example of our decoupling approach. Ω and ΩDecoupled are the covariance matrices of the statistics of three studies A, B and C before and after decoupling, respectively. The thickness of the edges denotes the amount of correlation between the studies. After decoupling, the size of the nodes reflects the information that the studies contain in terms of the inverse variance.

We tested our framework for accuracy and power with five simulated datasets, each containing 1000 to 5000 individuals and 10,000 shared controls. A standard approach produced an inflated number of false positive. Our decoupling method, which systemically accounts for overlapping individuals in meta-analysis, and a standard splitting method, which splits controls into individual studies, both correctly controlled for type 1 errors. The advantage of our framework is apparent when assessing power; in one scenario, we gained 25% power in accounting for overlapping subjects with the decoupling when compared to the splitting method.

Next, we assessed the potential of our framework in identifying casual loci shared by multiple diseases and leveraging information from multiple tissues to increase power for eQTL identification. The decoupling and splitting methods controlled false-positive rates and produced significant p-values at several previously identified candidate shared loci among the three autoimmune conditions present in the Wellcome Trust Case Control Consortium (WTCCC) data. In comparison to the splitting method, our decoupling framework increased the significance of p-values in the shared loci test and increased the number of discovered eQTLs by 19%.

Our approach is flexible and allows many meta-analysis methods, such as the RE model, to account for dependency between studies and overlapping subjects. We developed this approach to complement standard software packages in the meta-analysis of GWAS. This project was led by Buhm Han and involved Dat Duong and Jae Hoon Sul. The article is available at:

The full citation to our paper is:

Han, Buhm; Duong, Dat; Sul, Jae Hoon; de Bakker, Paul I W; Eskin, Eleazar; Raychaudhuri, Soumya

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping. Journal Article

In: Hum Mol Genet, 2016, ISSN: 1460-2083.

Abstract | Links | BibTeX