GRAT: Speeding up Expression Quantitative Trail Loci (eQTL) Studies

Fig. 1. An example of applying GRAT in two hypothetical regions. First, the proxy SNP (rectangle) is tested and its statistics is compared to the threshold (dashed line). If the statistic is above the threshold, the remaining SNPs in the region are tested.

Fig. 1. An example of applying GRAT in two hypothetical regions. First, the proxy SNP (rectangle) is tested and its statistics is compared to the threshold (dashed line). If the statistic is above the threshold, the remaining SNPs in the region are tested.

Emrah Kostem in our group has recently published a new method for association analysis targeted toward eQTL analysis called GRAT(10.1007/978-3-642-37195-0_10).  GRAT is designed to speed up eQTL studies and the software is available at http://genetics.cs.ucla.edu/GRAT

Over the past few years, the genome wide association study (GWAS) approach has been applied to identify regions of the genome, which harbor genetic variation that affects gene expression levels. These regions are referred to as expression quantitative trait loci (eQTL)(10.1038/nrg2537).(10.1038/nrg1964). In a typical eQTL study, the GWAS approach is applied to tens of thousands of gene expression levels using millions of SNPs, resulting in billions of association statistics to be computed. This results in a tremendous computational burden, which is only increasing with sequencing technology collecting more genetic variations and high-throughput genomic data collecting more phenotypic data such as isoform expression(10.1016/j.tig.2010.10.006). This problem is compounded by the fact that some of the statistical techniques for analyzing eQTLs utilize mixed models and themselves are computationally expensive(10.1038/ng.548),(10.1038/nmeth.1681).(10.1038/ng.2310).

We recently published a paper on a method, GRAT, to perform association analysis in high-throughput phenotype datasets, such as the eQTL studies.
The key idea behind GRAT is that we first test a subset of the SNPs and only in regions where the statistic is above a threshold, we test the remaining regions. In contrast to testing all SNPs, our approach tests around 10% of the SNPs in two-stages and guarantees to identify all significant associations with a very high accuracy.

Here is a description of our method from the paper:

Genome-Wide Rapid Association Testing (GRAT)
In Figure 1, we consider two possible scenarios for a genomic region in a GWAS. In (a) the region contains no significant associations and in (b) the region con- tains a causal SNP. In (a) and (b), the statistics for each SNP are shown, denoting what could have been observed in each scenario had all the SNPs in the region been tested. Let m2 be the proxy SNP for this region to decide whether or not to test the rest of the SNPs. We refer to the SNPs other than the proxy SNP ( m1, m3, m4, m5, m6 and m7 ) as the “remainder SNPs”. If the observed statistic of the proxy SNP is stronger than a threshold value, which in this example is 3.0, the remainder SNPs are tested.
In the first-stage, only the proxy SNP is tested and its association statistic is observed. In (a), where the region contains no associations, the statistic of the proxy SNP is 0.7. The observed statistic of the proxy is less than the threshold value ( 0.7 < 3.0 ) and hence none of the remainder SNPs within the region are tested. In (b), the region contains associations and the proxy SNP captures this information. The observed statistic of the proxy SNP is stronger than the thresh- old value ( 5.0 > 3.0 ), which leads to testing each of the remainder SNPs in the region. This results in identifying all the significant SNPs ( m3, m4 and m5 ).

In the paper, we introduce a novel approach for choosing the proxy SNPs and the threshold values, which provide guarantees that all statistically significant associations will be discovered while computing the least amount of association tests. Due to the complexity of linkage disequilibrium (LD) across the genome, we use a separate threshold value for each remainder SNP rather than using a common threshold value for all the remainders SNPs in an LD region. This is performed by pairing each remainder SNP with its most strongly correlated proxy SNP and a threshold value is used for the pair to decide whether or not to test the remainder SNP. We have precomputed the proxy SNPs for the 1000 Genomes Project and studies imputing to SNPs in this reference can benefit from our method. Even though the LD structure among the SNPs in the study and the reference dataset may be different, our method guarantees to discover all significant associations with high-probability. This is achieved by updating the threshold values using the LD structure observed in the study. We term our novel two-stage testing procedure as Genome-wide Rapid Association Testing (GRAT).

GRAT can be applied to a wide range of statistical models, such as case- control studies, quantitative traits and linear mixed models (LMM). In particu- lar, the LMM approach has recently become popular due to its effective control of population structure. Computing the LMM association statistic is compu- tationally expensive and recently its efficient computation has attracted great interest(10.1038/ng.548),(10.1038/nmeth.1681).(10.1038/ng.2310). The speed-up due to GRAT is cumulative with these efforts.

There are some interesting aspects to the computational method. For given proxy SNPs, our method solves an optimization problem that minimizes the number of SNPs tested that will result in a given recall rate. We prove that this problem is convex and show that it can be solved very efficiently.
Furthermore, we propose a greedy algorithm to search for the optimal proxy SNPs.

The full citation is:

Kostem, Emrah; Eskin, Eleazar

Efficiently Identifying Significant Associations in Genome-Wide Association Studies Conference

Research in Computational Molecular Biology, University of California Springer Berlin Heidelberg, 2013.

Abstract | Links | BibTeX

Bibliography

How much does part of a genome contribute to a trait?

Both genetic and environmental factors contribute to a trait.  The genetic factors which contribute to a trait are typically spread over the genome.  Emrah Kostem in our group recently published a paper on estimating how much a specific genomic region (such as a single chromosome) contributes to a trait(10.1016/j.ajhg.2013.03.010) and released a software for performing this analysis called HEIDI which is available at http://genetics.cs.ucla.edu/heritability/.  This type of analysis is referred to as “partitioning heritability into the contributions of genomic regions.”

Estimating the heritability of a trait, e.g., measuring the influence of nature vs. nurture, has been a fundamental question in genetics. Traditionally, heritabilities were estimated using related individuals with known pedigrees such as twins or family cohorts. With the availability of high-throughput genomic technologies, it has been shown that heritabilities to those similar to the traditionally estimated can be obtained from genome-wide association study (GWAS) datasets utilizing unrelated individuals(10.1038/ng.608). In these approaches, the genetic similarities, or kinships, among the individuals are computed from the observed spectrum of the SNPs rather than inferring them from a given pedigree data.

Additionally, high-throughput SNP data makes it also possible to estimate local genetic similarities, which has recently been used to partition the heritability of a trait into the contributions of genomic regions(10.1038/ng.823). A naive approach estimates the heritability contributions using a linear mixed model (LMM) approach, where each region is modeled using a separate variance component.

We presented a method called HEIDI (Heritability Estimations Distributed) to improve the accuracy and computational efficiency of partitioning the heritability of a trait into the contributions of genomic regions. We show that the naive approach is not accurate for large number of regions and also does not scale for more than several partitions per chromosome in a study with 5000 individuals. We proposed an alternative approach, where the heritability contribution of a region is obtained using a model that includes the region and its genetic complement, or the rest of the genome. The advantage of using a two-component model is that it is computationally efficient and fast to fit. Additionally, it also makes it possible to parallelize the heritability estimations, where the computation of each region can be performed separately across computers.

We show the estimates of heritability contributions is inflated when the region and its genetic complement have SNPs that are in linkage disequilibrium (LD) and introduce a normalization procedure to mitigate the effect of LD. We normalize the contributions of the chromosomes such that their sum equals to the genome-wide heritability estimate and in each chromosome the regions’ contributions are normalized that sum up to the chromosome contribution.

The full citation to the paper is:

Kostem, Emrah; Eskin, Eleazar

Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Journal Article

Am J Hum Genet, 92 (4), pp. 558-64, 2013, ISSN: 1537-6605.

Abstract | Links | BibTeX

Bibliography