GRAT: Speeding up Expression Quantitative Trail Loci (eQTL) Studies

Fig. 1. An example of applying GRAT in two hypothetical regions. First, the proxy SNP (rectangle) is tested and its statistics is compared to the threshold (dashed line). If the statistic is above the threshold, the remaining SNPs in the region are tested.

Fig. 1. An example of applying GRAT in two hypothetical regions. First, the proxy SNP (rectangle) is tested and its statistics is compared to the threshold (dashed line). If the statistic is above the threshold, the remaining SNPs in the region are tested.

Emrah Kostem in our group has recently published a new method for association analysis targeted toward eQTL analysis called GRAT(10.1007/978-3-642-37195-0_10).  GRAT is designed to speed up eQTL studies and the software is available at

Over the past few years, the genome wide association study (GWAS) approach has been applied to identify regions of the genome, which harbor genetic variation that affects gene expression levels. These regions are referred to as expression quantitative trait loci (eQTL)(10.1038/nrg2537).(10.1038/nrg1964). In a typical eQTL study, the GWAS approach is applied to tens of thousands of gene expression levels using millions of SNPs, resulting in billions of association statistics to be computed. This results in a tremendous computational burden, which is only increasing with sequencing technology collecting more genetic variations and high-throughput genomic data collecting more phenotypic data such as isoform expression(10.1016/j.tig.2010.10.006). This problem is compounded by the fact that some of the statistical techniques for analyzing eQTLs utilize mixed models and themselves are computationally expensive(10.1038/ng.548),(10.1038/nmeth.1681).(10.1038/ng.2310).

We recently published a paper on a method, GRAT, to perform association analysis in high-throughput phenotype datasets, such as the eQTL studies.
The key idea behind GRAT is that we first test a subset of the SNPs and only in regions where the statistic is above a threshold, we test the remaining regions. In contrast to testing all SNPs, our approach tests around 10% of the SNPs in two-stages and guarantees to identify all significant associations with a very high accuracy.

Here is a description of our method from the paper:

Genome-Wide Rapid Association Testing (GRAT)
In Figure 1, we consider two possible scenarios for a genomic region in a GWAS. In (a) the region contains no significant associations and in (b) the region con- tains a causal SNP. In (a) and (b), the statistics for each SNP are shown, denoting what could have been observed in each scenario had all the SNPs in the region been tested. Let m2 be the proxy SNP for this region to decide whether or not to test the rest of the SNPs. We refer to the SNPs other than the proxy SNP ( m1, m3, m4, m5, m6 and m7 ) as the “remainder SNPs”. If the observed statistic of the proxy SNP is stronger than a threshold value, which in this example is 3.0, the remainder SNPs are tested.
In the first-stage, only the proxy SNP is tested and its association statistic is observed. In (a), where the region contains no associations, the statistic of the proxy SNP is 0.7. The observed statistic of the proxy is less than the threshold value ( 0.7 < 3.0 ) and hence none of the remainder SNPs within the region are tested. In (b), the region contains associations and the proxy SNP captures this information. The observed statistic of the proxy SNP is stronger than the thresh- old value ( 5.0 > 3.0 ), which leads to testing each of the remainder SNPs in the region. This results in identifying all the significant SNPs ( m3, m4 and m5 ).

In the paper, we introduce a novel approach for choosing the proxy SNPs and the threshold values, which provide guarantees that all statistically significant associations will be discovered while computing the least amount of association tests. Due to the complexity of linkage disequilibrium (LD) across the genome, we use a separate threshold value for each remainder SNP rather than using a common threshold value for all the remainders SNPs in an LD region. This is performed by pairing each remainder SNP with its most strongly correlated proxy SNP and a threshold value is used for the pair to decide whether or not to test the remainder SNP. We have precomputed the proxy SNPs for the 1000 Genomes Project and studies imputing to SNPs in this reference can benefit from our method. Even though the LD structure among the SNPs in the study and the reference dataset may be different, our method guarantees to discover all significant associations with high-probability. This is achieved by updating the threshold values using the LD structure observed in the study. We term our novel two-stage testing procedure as Genome-wide Rapid Association Testing (GRAT).

GRAT can be applied to a wide range of statistical models, such as case- control studies, quantitative traits and linear mixed models (LMM). In particu- lar, the LMM approach has recently become popular due to its effective control of population structure. Computing the LMM association statistic is compu- tationally expensive and recently its efficient computation has attracted great interest(10.1038/ng.548),(10.1038/nmeth.1681).(10.1038/ng.2310). The speed-up due to GRAT is cumulative with these efforts.

There are some interesting aspects to the computational method. For given proxy SNPs, our method solves an optimization problem that minimizes the number of SNPs tested that will result in a given recall rate. We prove that this problem is convex and show that it can be solved very efficiently.
Furthermore, we propose a greedy algorithm to search for the optimal proxy SNPs.

The full citation is:

Kostem, Emrah; Eskin, Eleazar

Efficiently Identifying Significant Associations in Genome-Wide Association Studies Conference

Research in Computational Molecular Biology, University of California Springer Berlin Heidelberg, 2013.

Abstract | Links | BibTeX


How much does part of a genome contribute to a trait?

Both genetic and environmental factors contribute to a trait.  The genetic factors which contribute to a trait are typically spread over the genome.  Emrah Kostem in our group recently published a paper on estimating how much a specific genomic region (such as a single chromosome) contributes to a trait(10.1016/j.ajhg.2013.03.010) and released a software for performing this analysis called HEIDI which is available at  This type of analysis is referred to as “partitioning heritability into the contributions of genomic regions.”

Estimating the heritability of a trait, e.g., measuring the influence of nature vs. nurture, has been a fundamental question in genetics. Traditionally, heritabilities were estimated using related individuals with known pedigrees such as twins or family cohorts. With the availability of high-throughput genomic technologies, it has been shown that heritabilities to those similar to the traditionally estimated can be obtained from genome-wide association study (GWAS) datasets utilizing unrelated individuals(10.1038/ng.608). In these approaches, the genetic similarities, or kinships, among the individuals are computed from the observed spectrum of the SNPs rather than inferring them from a given pedigree data.

Additionally, high-throughput SNP data makes it also possible to estimate local genetic similarities, which has recently been used to partition the heritability of a trait into the contributions of genomic regions(10.1038/ng.823). A naive approach estimates the heritability contributions using a linear mixed model (LMM) approach, where each region is modeled using a separate variance component.

We presented a method called HEIDI (Heritability Estimations Distributed) to improve the accuracy and computational efficiency of partitioning the heritability of a trait into the contributions of genomic regions. We show that the naive approach is not accurate for large number of regions and also does not scale for more than several partitions per chromosome in a study with 5000 individuals. We proposed an alternative approach, where the heritability contribution of a region is obtained using a model that includes the region and its genetic complement, or the rest of the genome. The advantage of using a two-component model is that it is computationally efficient and fast to fit. Additionally, it also makes it possible to parallelize the heritability estimations, where the computation of each region can be performed separately across computers.

We show the estimates of heritability contributions is inflated when the region and its genetic complement have SNPs that are in linkage disequilibrium (LD) and introduce a normalization procedure to mitigate the effect of LD. We normalize the contributions of the chromosomes such that their sum equals to the genome-wide heritability estimate and in each chromosome the regions’ contributions are normalized that sum up to the chromosome contribution.

The full citation to the paper is:

Kostem, Emrah; Eskin, Eleazar

Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Journal Article

Am J Hum Genet, 92 (4), pp. 558-64, 2013, ISSN: 1537-6605.

Abstract | Links | BibTeX


Genes, Diet, and Body Weight (in Mice)

What affects body weight? Is it genetic factors or is it diet or is it both?

Our group collaborated with Jake Lusis’ group on a mouse study led by Brian Parks that aimed to address this question and the results were published in Cell Metabolism (10.1016/j.cmet.2012.12.007). From our group, Emrah Kostem contributed to the study. The study received a lot of press coverage including Science News, Huffington Post, and

It turns out that not only do both genes and diet contribute to body weight, a significant factor is the interaction between genes and diet. Some strains of mice gained a significant amount of weight on a high fat diet, whiles others did not. These types of interactions, or “gene-by-environment” interactions are if great interest to our group and we are working on several projects on this topic.

What is also exciting about this study is that it is the first published report of our second round of studies using the Hybrid Mouse Diversity Panel (HMDP) (10.1007/s00335-012-9411-5). The first round of studies reported associations for lipids(10.1101/gr.099234.109), bone traits (10.1371/journal.pgen.1002038), and fear conditioning (10.1186/1752-0509-5-43). This next round focuses on gene-by-environment interactions. The HMDP is now no longer just a UCLA project. Now several groups at other institutions including USC, UC Berkeley and University of Washington are also involved in HMDP studies.


Full Citation:

Parks, B.W., Nam, E., Org, E., Kostem, E., Norheim, F., Hui, S.T., Pan, C., Civelek, M., Rau, C.D., Bennett, B.J., Mehrabian, M., Ursell, L.K., He, A., Castellani, L.W., Zinker, B., Kirby, M., Drake, T.A., Drevon, C.A., Knight, R., Gargalovic, P., Kirchgessner, T., Eskin, E. & Lusis, A.J., 2013, Genetic control of obesity and gut microbiota composition in response to high-fat, high-sucrose diet in mice, Cell Metab, 17(1), pp. 141-52.


Obesity is a highly heritable disease driven by complex interactions between genetic and environmental factors. Human genome-wide association studies (GWAS) have identified a number of loci contributing to obesity; however, a major limitation of these studies is the inability to assess environmental interactions common to obesity. Using a systems genetics approach, we measured obesity traits, global gene expression, and gut microbiota composition in response to a high-fat/high-sucrose (HF/HS) diet of more than 100 inbred strains of mice. Here we show that HF/HS feeding promotes robust, strain-specific changes in obesity that are not accounted for by food intake and provide evidence for a genetically determined set point for obesity. GWAS analysis identified 11 genome-wide significant loci associated with obesity traits, several of which overlap with loci identified in human studies. We also show strong relationships between genotype and gut microbiota plasticity during HF/HS feeding and identify gut microbial phylotypes associated with obesity.