Widespread Allelic Heterogeneity in Complex Traits

This week, our group published a paper in the American Journal of Human Genetics that presents a new computational method for improving the accuracy of genome wide association studies. ZarLab alumni Farhad Hormozdiari (PhD, 2016) developed the method, CAVIAR (CAusal Variants Identification in Associated Regions), a statistical framework that quantifies the probability of each variant to be causal while allowing an arbitrary number of causal variants.

Genome-wide association studies (GWASs) identify genetic variants associated with diseases and traits. Recent successes in GWASs make it possible to address important questions about the genetic architecture of complex traits, such as allele frequency and effect size. A more comprehensive understanding of these aspects will guide the development of new methods for fine mapping and association mapping of complex traits—and the discovery of new biomarkers for disease diagnosis and treatment.

One lesser-known aspect of complex traits is the extent of allelic heterogeneity (AH). Allelic heterogeneity occurs when different mutations at the same locus affects the same phenotype. AH is very common in Mendelian traits, but we know little about the extent to which AH contributes to common, complex disease. Undetected AH could potentially bias results of an association study, leading to false positive results.

Levels of Allelic Heterogeneity in eQTL Studies. For more information, see our paper.

In order to take AH into account while conducting a GWAS, we developed a computational method to infer the probability of AH. Our method quantifies the number of independent causal variants at a locus that can be responsible for the observed association signals detected in a GWAS. Our method is incorporated into the CAVIAR approach, and it is based on the principle of jointly analyzing association signals (i.e., summary level Z-scores) and LD structure in order to estimate the number of causal variants.

Our results show that our method is more accurate than the standard conditional method (CM). We applied our novel method to three GWASs and four expression quantitative trait loci (eQTL) datasets. We identified a total of 4,152 loci with strong evidence of the presence of AH. The proportion of all loci with identified AH is 4%–23% in eQTLs, 35% in GWASs of high-density lipoprotein (HDL), and 23% in GWASs of schizophrenia. For eQTLs, we observed a strong correlation between sample size and the proportion of loci with AH, indicating that statistical power prevents identification of AH in other loci.

One of the main benefits of our method is that it requires only summary statistics. Summary statistics of a GWAS or eQTL study are widely available, so our method is applicable to most existing datasets. We have shown that AH is widespread and more common than previously estimated in complex traits, both in GWASs and eQTL studies.

Our results highlight the importance of accounting for the presence of multiple causal variants when characterizing the mechanism of genetic association in complex traits. Falling to account for AH can reduce the power to detect true causal variants and can explain the limited success of fine mapping of GWASs.

In a related study, researchers at University of California, Irvine, and University of Kansas, identified an analogous signal in eQTLs from genetic sequencing of flies. King et al. (2014) observe that the vast majority of genes with eQTL are more consistent with heterogeneity than bi-allelism. Read more about this related study, “Genetic Dissection of the Drosophila melanogaster Female Head Transcriptome Reveals Widespread Allelic Heterogeneity.”

CAVIAR was created by Farhad Hormozdiari, Emrah Kostem, Eun Yong Kang, Bogdan Pasaniuc and Eleazar Eskin. Software is freely available for download: http://genetics.cs.ucla.edu/caviar/

For more information, see our full paper, which can be accessed through AJHGhttp://www.cell.com/ajhg/abstract/S0002-9297(17)30149-0

The full citation of our paper:
Hormozdiari F, Zhu A, Kichaev G, Ju CJ, Segrè AV, Joo JW, Won H, Sankararaman S, Pasaniuc B, Shifman S, Eskin E. Widespread allelic heterogeneity in complex traits. The American Journal of Human Genetics. 2017 May 4;100(5):789-802.

Causal Variants Identification in Associated Regions

Figure 1 (A and B) Simulated data for two regions with different LD patterns that contain 35 SNPs. A and B are obtained by considering the 100 kbp upstream and downstream of rs10962894 and rs4740698, respectively, from the Wellcome Trust Case–Control Consortium study for coronary artery disease (CAD). (C and D) The rank of the causal SNP in additional simulations for the regions in A and B, respectively. We obtain these histograms from simulation data by randomly generating GWAS statistics using multivariate normal distribution. We apply the simulation 1000 times.

Figure 1 (A and B) Simulated data for two regions with different LD patterns that contain 35 SNPs. A and B are obtained by considering the 100 kbp upstream and downstream of rs10962894 and rs4740698, respectively, from the Wellcome Trust Case–Control Consortium study for coronary artery disease (CAD).

Our group in collaboration with our UCLA colleague Bogdan Pasaniuc’s group recently published two papers focusing on “statistical fine mapping”. We published a paper on a method called CAVIAR in the journal Genetics and Bogdan’s lab published a method called PAINTOR in PLoS Genetics. The software is available at http://genetics.cs.ucla.edu/caviar/ and http://bogdan.bioinformatics.ucla.edu/software/PAINTOR/.

Although genome-wide association studies have successfully identified thousands of regions of the genome which contain genetic variation involved in disease, only a handful of the biologically causal variants, responsible for these associations, have been successfully identified. Because of the correlation structure of genetic variants, in each region, there are many variants that are associated with disease. The process of predicting which subset of the genetic variants are actually responsible for the association is referred to as statistical mapping.

Current statistical methods for identifying causal variants at risk loci either use the strength of association signal in an iterative conditioning framework, or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus which is typically invalid at many risk loci. In our papers, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g. 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants.

Figure 2 Simulated association with two causal SNPs. (A) The 100-kbp region around the rs10962894 SNP and simulated statistics at each SNP generated assuming two SNPs are causal. In this example SNP25 and SNP29 are considered as the causal SNPs. However, the most significant SNP is the SNP27. (B) The causal set selected by CAVIAR (our method) and the top k SNPs method. We ranked the selected SNPs based on the association statistics. The gray bars indicate the selected SNPs by both methods, the green bars indicate the selected SNPs by the top k SNPs method only, and the blue bars indicate the selected SNPs by CAVIAR only. The CAVIAR set consists of SNP17, SNP20, SNP21, SNP25, SNP26, SNP28, and SNP29. For the top k SNPs method to capture the two causal SNPs we have to set k to 11, as one of the causal SNPs is ranked 11th based on its significant score. Unfortunately, knowing the value of k beforehand is not possible. Even if the value of k is known, the causal set selected by our method excludes SNP30–SNP35 from the follow-up studies and reduces the cost of follow-up studies by 30% compared to the top k method.

Figure 2 Simulated association with two causal SNPs. (A) The 100-kbp region around the rs10962894 SNP and simulated statistics at each SNP generated assuming two SNPs are causal. In this example SNP25 and SNP29 are considered as the causal SNPs. However, the most significant SNP is the SNP27. (B) The causal set selected by CAVIAR (our method) and the top k SNPs method. We ranked the selected SNPs based on the association statistics. The gray bars indicate the selected SNPs by both methods, the green bars indicate the selected SNPs by the top k SNPs method only, and the blue bars indicate the selected SNPs by CAVIAR only. The CAVIAR set consists of SNP17, SNP20, SNP21, SNP25, SNP26, SNP28, and SNP29. For the top k SNPs method to capture the two causal SNPs we have to set k to 11, as one of the causal SNPs is ranked 11th based on its significant score. Unfortunately, knowing the value of k beforehand is not possible. Even if the value of k is known, the causal set selected by our method excludes SNP30–SNP35 from the follow-up studies and reduces the cost of follow-up studies by 30% compared to the top k method.

From the CAVIAR paper:
Overview of statistical fine mapping

Our approach, CAVIAR, takes as input the association statistics for all of the SNPs (variants) at the locus together with the correlation structure between the variants obtained from a reference data set such as the HapMap (Gibbs et al. 2003; Frazer et al. 2007) or 1000 Genomes project (Abecasis et al. 2010) data. Using this information, our method predicts a subset of the variants that has the property that all the causal SNPs are contained in this set with the probability r (we term this set the “r causal set”). In practice we set r to values close to 100%, typically $95%, and let CAVIAR find the set with the fewest number of SNPs that contains the causal SNPs with probability at least r. The causal set can be viewed as a confidence interval. We use the causal set in the follow-up studies by validating only the SNPs that are present in the set. While in this article we discuss SNPs for simplicity, our approach can be applied to any type of genetic variants, including structural variants.

We used simulations to show the effect of LD on the resolution of fine mapping. We selected two risk loci (with large and small LD) to showcase the effect of LD on fine mapping (see Figure 1, A and B). The first region is obtained by considering 100 kbp upstream and downstream of the rs10962894 SNP from the coronary artery disease (CAD) case–control study. As shown in the Figure 1A, the correlation between the significant SNP and the neighboring SNPs is high. We simulated GWAS statistics for this region by taking advantage that the statistics follow a multivariate normal dis- tribution, as shown in Han et al. (2009) and Zaitlen et al. (2010) (see Materials and Methods). CAVIAR selects the true causal SNP, which is SNP8, together with six additional variants (Figure 1A). Thus, when following up this locus, we have only to consider these SNPs to identify the true causal SNPs. The second region showcases loci with lower LD (see Figure 1B). In this region only the true causal SNP is selected by CAVIAR (SNP18). As expected, the size of the r causal set is a function of the LD pattern in the locus and the value of r, with higher values of r resulting in larger sets (see Table S1 and Table S2).

We also showcase the scenario of multiple causal variants (see Figure 2). We simulated data as before and considered SNP25 and SNP29 as the causal SNPs. Interestingly, the most significant SNP (SNP27, see Figure 2) tags the true causal variants but it is not itself causal, making the selection based on strength of association alone under the assumption of a single causal or iterative conditioning highly suboptimal. To capture both causal SNPs at least 11 SNPs must be selected in ranking based on P-values or probabilities estimated under a single causal variant assumption. As opposed to existing approaches, CAVIAR selects both SNPs in the 95% causal set together with five additional variants. The gain in accuracy of our approach comes from accurately disregarding SNP30–SNP35 from consideration since their effects can be captured by other SNPs.

PAINTOR extended the CAVIAR model to also take into account the function of the genetic variation.

The full citations for the two papers are:

Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar

Identifying causal variants at Loci with multiple signals of association. Journal Article

In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631.

Abstract | Links | BibTeX

Kichaev, Gleb; Yang, Wen-Yun Y; Lindstrom, Sara ; Hormozdiari, Farhad ; Eskin, Eleazar ; Price, Alkes L; Kraft, Peter ; Pasaniuc, Bogdan

Integrating functional data to prioritize causal variants in statistical fine-mapping studies. Journal Article

In: PLoS Genet, 10 (10), pp. e1004722, 2014, ISSN: 1553-7404.

Abstract | Links | BibTeX

Emrah Kostem’s talk about his research

Emrah Kostem, who graduated this year and is now at Illumina, gave a talk about the research he completed in the lab this summer at our retreat.  It is available here and gives a good overview of what the goals of our group are and some details of the projects that Emrah completed in the lab.

One of the topics he discusses is his recently published work on estimating heritability, which is quantifying the amount that genetics accounts for the variance of a trait.  He discusses his work on how to partition heritability into the contributions of genomic regions(10.1016/j.ajhg.2013.03.010).

He also talks about his work which takes advantage of the insight that association statistics follow the multivariate normal distribution and applies this to two problems.  The first is the problem of selecting follow up SNPs using the results of an association study(10.1534/genetics.111.128595).  The second problem is the problem of speeding up eQTL studies using a two stage approach where only a fraction of the association tests are performed but virtually all of the significant associations are still discovered(10.1089/cmb.2013.0087).

Details of what he talked about are in his papers:

Kostem, Emrah; Eskin, Eleazar

Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Journal Article

In: Am J Hum Genet, 92 (4), pp. 558-64, 2013, ISSN: 1537-6605.

Abstract | Links | BibTeX

Kostem, Emrah; Eskin, Eleazar

Efficiently Identifying Significant Associations in Genome-wide Association Studies. Journal Article

In: J Comput Biol, 20 (10), pp. 817-30, 2013, ISSN: 1557-8666.

Abstract | Links | BibTeX

Kostem, Emrah; Lozano, Jose A; Eskin, Eleazar

Increasing Power of Genome-wide Association Studies by Collecting Additional SNPs. Journal Article

In: Genetics, 2011, ISSN: 1943-2631.

Abstract | Links | BibTeX

Bibliography