Causal Variants Identification in Associated Regions

Figure 1 (A and B) Simulated data for two regions with different LD patterns that contain 35 SNPs. A and B are obtained by considering the 100 kbp upstream and downstream of rs10962894 and rs4740698, respectively, from the Wellcome Trust Case–Control Consortium study for coronary artery disease (CAD). (C and D) The rank of the causal SNP in additional simulations for the regions in A and B, respectively. We obtain these histograms from simulation data by randomly generating GWAS statistics using multivariate normal distribution. We apply the simulation 1000 times.

Figure 1 (A and B) Simulated data for two regions with different LD patterns that contain 35 SNPs. A and B are obtained by considering the 100 kbp upstream and downstream of rs10962894 and rs4740698, respectively, from the Wellcome Trust Case–Control Consortium study for coronary artery disease (CAD).

Our group in collaboration with our UCLA colleague Bogdan Pasaniuc’s group recently published two papers focusing on “statistical fine mapping”. We published a paper on a method called CAVIAR in the journal Genetics and Bogdan’s lab published a method called PAINTOR in PLoS Genetics. The software is available at http://genetics.cs.ucla.edu/caviar/ and http://bogdan.bioinformatics.ucla.edu/software/PAINTOR/.

Although genome-wide association studies have successfully identified thousands of regions of the genome which contain genetic variation involved in disease, only a handful of the biologically causal variants, responsible for these associations, have been successfully identified. Because of the correlation structure of genetic variants, in each region, there are many variants that are associated with disease. The process of predicting which subset of the genetic variants are actually responsible for the association is referred to as statistical mapping.

Current statistical methods for identifying causal variants at risk loci either use the strength of association signal in an iterative conditioning framework, or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus which is typically invalid at many risk loci. In our papers, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g. 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants.

Figure 2 Simulated association with two causal SNPs. (A) The 100-kbp region around the rs10962894 SNP and simulated statistics at each SNP generated assuming two SNPs are causal. In this example SNP25 and SNP29 are considered as the causal SNPs. However, the most significant SNP is the SNP27. (B) The causal set selected by CAVIAR (our method) and the top k SNPs method. We ranked the selected SNPs based on the association statistics. The gray bars indicate the selected SNPs by both methods, the green bars indicate the selected SNPs by the top k SNPs method only, and the blue bars indicate the selected SNPs by CAVIAR only. The CAVIAR set consists of SNP17, SNP20, SNP21, SNP25, SNP26, SNP28, and SNP29. For the top k SNPs method to capture the two causal SNPs we have to set k to 11, as one of the causal SNPs is ranked 11th based on its significant score. Unfortunately, knowing the value of k beforehand is not possible. Even if the value of k is known, the causal set selected by our method excludes SNP30–SNP35 from the follow-up studies and reduces the cost of follow-up studies by 30% compared to the top k method.

Figure 2 Simulated association with two causal SNPs. (A) The 100-kbp region around the rs10962894 SNP and simulated statistics at each SNP generated assuming two SNPs are causal. In this example SNP25 and SNP29 are considered as the causal SNPs. However, the most significant SNP is the SNP27. (B) The causal set selected by CAVIAR (our method) and the top k SNPs method. We ranked the selected SNPs based on the association statistics. The gray bars indicate the selected SNPs by both methods, the green bars indicate the selected SNPs by the top k SNPs method only, and the blue bars indicate the selected SNPs by CAVIAR only. The CAVIAR set consists of SNP17, SNP20, SNP21, SNP25, SNP26, SNP28, and SNP29. For the top k SNPs method to capture the two causal SNPs we have to set k to 11, as one of the causal SNPs is ranked 11th based on its significant score. Unfortunately, knowing the value of k beforehand is not possible. Even if the value of k is known, the causal set selected by our method excludes SNP30–SNP35 from the follow-up studies and reduces the cost of follow-up studies by 30% compared to the top k method.

From the CAVIAR paper:
Overview of statistical fine mapping

Our approach, CAVIAR, takes as input the association statistics for all of the SNPs (variants) at the locus together with the correlation structure between the variants obtained from a reference data set such as the HapMap (Gibbs et al. 2003; Frazer et al. 2007) or 1000 Genomes project (Abecasis et al. 2010) data. Using this information, our method predicts a subset of the variants that has the property that all the causal SNPs are contained in this set with the probability r (we term this set the “r causal set”). In practice we set r to values close to 100%, typically $95%, and let CAVIAR find the set with the fewest number of SNPs that contains the causal SNPs with probability at least r. The causal set can be viewed as a confidence interval. We use the causal set in the follow-up studies by validating only the SNPs that are present in the set. While in this article we discuss SNPs for simplicity, our approach can be applied to any type of genetic variants, including structural variants.

We used simulations to show the effect of LD on the resolution of fine mapping. We selected two risk loci (with large and small LD) to showcase the effect of LD on fine mapping (see Figure 1, A and B). The first region is obtained by considering 100 kbp upstream and downstream of the rs10962894 SNP from the coronary artery disease (CAD) case–control study. As shown in the Figure 1A, the correlation between the significant SNP and the neighboring SNPs is high. We simulated GWAS statistics for this region by taking advantage that the statistics follow a multivariate normal dis- tribution, as shown in Han et al. (2009) and Zaitlen et al. (2010) (see Materials and Methods). CAVIAR selects the true causal SNP, which is SNP8, together with six additional variants (Figure 1A). Thus, when following up this locus, we have only to consider these SNPs to identify the true causal SNPs. The second region showcases loci with lower LD (see Figure 1B). In this region only the true causal SNP is selected by CAVIAR (SNP18). As expected, the size of the r causal set is a function of the LD pattern in the locus and the value of r, with higher values of r resulting in larger sets (see Table S1 and Table S2).

We also showcase the scenario of multiple causal variants (see Figure 2). We simulated data as before and considered SNP25 and SNP29 as the causal SNPs. Interestingly, the most significant SNP (SNP27, see Figure 2) tags the true causal variants but it is not itself causal, making the selection based on strength of association alone under the assumption of a single causal or iterative conditioning highly suboptimal. To capture both causal SNPs at least 11 SNPs must be selected in ranking based on P-values or probabilities estimated under a single causal variant assumption. As opposed to existing approaches, CAVIAR selects both SNPs in the 95% causal set together with five additional variants. The gain in accuracy of our approach comes from accurately disregarding SNP30–SNP35 from consideration since their effects can be captured by other SNPs.

PAINTOR extended the CAVIAR model to also take into account the function of the genetic variation.

The full citations for the two papers are:

Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar

Identifying causal variants at Loci with multiple signals of association. Journal Article

In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631.

Abstract | Links | BibTeX

Kichaev, Gleb; Yang, Wen-Yun Y; Lindstrom, Sara ; Hormozdiari, Farhad ; Eskin, Eleazar ; Price, Alkes L; Kraft, Peter ; Pasaniuc, Bogdan

Integrating functional data to prioritize causal variants in statistical fine-mapping studies. Journal Article

In: PLoS Genet, 10 (10), pp. e1004722, 2014, ISSN: 1553-7404.

Abstract | Links | BibTeX

Emrah Kostem’s talk about his research

Emrah Kostem, who graduated this year and is now at Illumina, gave a talk about the research he completed in the lab this summer at our retreat.  It is available here and gives a good overview of what the goals of our group are and some details of the projects that Emrah completed in the lab.

One of the topics he discusses is his recently published work on estimating heritability, which is quantifying the amount that genetics accounts for the variance of a trait.  He discusses his work on how to partition heritability into the contributions of genomic regions(10.1016/j.ajhg.2013.03.010).

He also talks about his work which takes advantage of the insight that association statistics follow the multivariate normal distribution and applies this to two problems.  The first is the problem of selecting follow up SNPs using the results of an association study(10.1534/genetics.111.128595).  The second problem is the problem of speeding up eQTL studies using a two stage approach where only a fraction of the association tests are performed but virtually all of the significant associations are still discovered(10.1089/cmb.2013.0087).

Details of what he talked about are in his papers:

Kostem, Emrah; Eskin, Eleazar

Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Journal Article

In: Am J Hum Genet, 92 (4), pp. 558-64, 2013, ISSN: 1537-6605.

Abstract | Links | BibTeX

Kostem, Emrah; Eskin, Eleazar

Efficiently Identifying Significant Associations in Genome-wide Association Studies. Journal Article

In: J Comput Biol, 20 (10), pp. 817-30, 2013, ISSN: 1557-8666.

Abstract | Links | BibTeX

Kostem, Emrah; Lozano, Jose A; Eskin, Eleazar

Increasing Power of Genome-wide Association Studies by Collecting Additional SNPs. Journal Article

In: Genetics, 2011, ISSN: 1943-2631.

Abstract | Links | BibTeX

Bibliography

GRAT: Speeding up Expression Quantitative Trail Loci (eQTL) Studies

Fig. 1. An example of applying GRAT in two hypothetical regions. First, the proxy SNP (rectangle) is tested and its statistics is compared to the threshold (dashed line). If the statistic is above the threshold, the remaining SNPs in the region are tested.

Fig. 1. An example of applying GRAT in two hypothetical regions. First, the proxy SNP (rectangle) is tested and its statistics is compared to the threshold (dashed line). If the statistic is above the threshold, the remaining SNPs in the region are tested.

Emrah Kostem in our group has recently published a new method for association analysis targeted toward eQTL analysis called GRAT(10.1007/978-3-642-37195-0_10).  GRAT is designed to speed up eQTL studies and the software is available at http://genetics.cs.ucla.edu/GRAT

Over the past few years, the genome wide association study (GWAS) approach has been applied to identify regions of the genome, which harbor genetic variation that affects gene expression levels. These regions are referred to as expression quantitative trait loci (eQTL)(10.1038/nrg2537).(10.1038/nrg1964). In a typical eQTL study, the GWAS approach is applied to tens of thousands of gene expression levels using millions of SNPs, resulting in billions of association statistics to be computed. This results in a tremendous computational burden, which is only increasing with sequencing technology collecting more genetic variations and high-throughput genomic data collecting more phenotypic data such as isoform expression(10.1016/j.tig.2010.10.006). This problem is compounded by the fact that some of the statistical techniques for analyzing eQTLs utilize mixed models and themselves are computationally expensive(10.1038/ng.548),(10.1038/nmeth.1681).(10.1038/ng.2310).

We recently published a paper on a method, GRAT, to perform association analysis in high-throughput phenotype datasets, such as the eQTL studies.
The key idea behind GRAT is that we first test a subset of the SNPs and only in regions where the statistic is above a threshold, we test the remaining regions. In contrast to testing all SNPs, our approach tests around 10% of the SNPs in two-stages and guarantees to identify all significant associations with a very high accuracy.

Here is a description of our method from the paper:

Genome-Wide Rapid Association Testing (GRAT)
In Figure 1, we consider two possible scenarios for a genomic region in a GWAS. In (a) the region contains no significant associations and in (b) the region con- tains a causal SNP. In (a) and (b), the statistics for each SNP are shown, denoting what could have been observed in each scenario had all the SNPs in the region been tested. Let m2 be the proxy SNP for this region to decide whether or not to test the rest of the SNPs. We refer to the SNPs other than the proxy SNP ( m1, m3, m4, m5, m6 and m7 ) as the “remainder SNPs”. If the observed statistic of the proxy SNP is stronger than a threshold value, which in this example is 3.0, the remainder SNPs are tested.
In the first-stage, only the proxy SNP is tested and its association statistic is observed. In (a), where the region contains no associations, the statistic of the proxy SNP is 0.7. The observed statistic of the proxy is less than the threshold value ( 0.7 < 3.0 ) and hence none of the remainder SNPs within the region are tested. In (b), the region contains associations and the proxy SNP captures this information. The observed statistic of the proxy SNP is stronger than the thresh- old value ( 5.0 > 3.0 ), which leads to testing each of the remainder SNPs in the region. This results in identifying all the significant SNPs ( m3, m4 and m5 ).

In the paper, we introduce a novel approach for choosing the proxy SNPs and the threshold values, which provide guarantees that all statistically significant associations will be discovered while computing the least amount of association tests. Due to the complexity of linkage disequilibrium (LD) across the genome, we use a separate threshold value for each remainder SNP rather than using a common threshold value for all the remainders SNPs in an LD region. This is performed by pairing each remainder SNP with its most strongly correlated proxy SNP and a threshold value is used for the pair to decide whether or not to test the remainder SNP. We have precomputed the proxy SNPs for the 1000 Genomes Project and studies imputing to SNPs in this reference can benefit from our method. Even though the LD structure among the SNPs in the study and the reference dataset may be different, our method guarantees to discover all significant associations with high-probability. This is achieved by updating the threshold values using the LD structure observed in the study. We term our novel two-stage testing procedure as Genome-wide Rapid Association Testing (GRAT).

GRAT can be applied to a wide range of statistical models, such as case- control studies, quantitative traits and linear mixed models (LMM). In particu- lar, the LMM approach has recently become popular due to its effective control of population structure. Computing the LMM association statistic is compu- tationally expensive and recently its efficient computation has attracted great interest(10.1038/ng.548),(10.1038/nmeth.1681).(10.1038/ng.2310). The speed-up due to GRAT is cumulative with these efforts.

There are some interesting aspects to the computational method. For given proxy SNPs, our method solves an optimization problem that minimizes the number of SNPs tested that will result in a given recall rate. We prove that this problem is convex and show that it can be solved very efficiently.
Furthermore, we propose a greedy algorithm to search for the optimal proxy SNPs.

The full citation is:

Kostem, Emrah; Eskin, Eleazar

Efficiently Identifying Significant Associations in Genome-Wide Association Studies Conference

Research in Computational Molecular Biology, University of California Springer Berlin Heidelberg, 2013.

Abstract | Links | BibTeX

Bibliography