Identification of causal genes for complex traits (CAVIAR-gene)

Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider ‘causal variants’ as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations.

In our recently published work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability q. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2.

In the context of association studies, the genetic variants which are responsible for the association signal at a locus are referred to in the genetics literature as the ‘causal variants.’ Causal variants have biological effect on the phenotype.

CAVIAR-Gene provides better ranking of the causal genes for Outbred, F2, and HMDP datasets. Panels a and b illustrate the results for Outbred genotypes for case where we have one causal and two causal genes, respectively. Panels c and d illustrate the results for F2 genotypes for case where we have one causal and two causal genes, respectively. Panels e and f illustrate the results for Outbred genotypes for case where we have one causal and two causal genes, respectively.

CAVIAR-Gene provides better ranking of the causal genes for Outbred, F2, and HMDP datasets. Panels a and b illustrate the results for Outbred genotypes for case where we have one causal and two causal genes, respectively. Panels c and d illustrate the results for F2 genotypes for case where we have one causal and two causal genes, respectively. Panels e and f illustrate the results for Outbred genotypes for case where we have one causal and two causal genes, respectively.

Generally, variants can be categorized into three main groups. The first group is the causal variants which have a biological effect on the phenotype and are responsible for the association signal. The second group is the variants which are statistically associated with the phenotype due to LD with a causal variant. Even though association tests for these variants may be statistically significant, under our definition, they are not causal variants. The third group is the variants which are not statistically associated with the phenotype and are not causal.

CAVIAR-Gene is a statistical method for fine mapping that addresses two main limitations of existing methods. First, as opposed to existing approaches that focus on individual variants, we propose to search only over the space of gene combinations that explain the statistical association signal, and thus drastically reduce runtime. Second, CAVIAR-Gene extends existing framework for fine mapping to account for population structure. The output of our approach is a minimal set of genes that will contain the true casual gene at a pre-specified significance level.  The output of our approach is a minimal set of genes that will contain the true casual gene at a pre-specified significance level. This gene set together with its individual gene probability of causality provides a natural way of prioritizing genes for functional testing (e.g. knockout strategies) in model organisms. Through extensive simulations, we demonstrate that CAVIAR-Gene is superior to existing methodologies, requiring the smallest set of genes to follow-up in order to capture the true causal gene(s).

Building off our previous work with CAVIAR,  CAVIAR-Gene takes as input the marginal statistics for each variant at a locus, an LD matrix consisting of pairwise Pearson correlations computed between the genotypes of a pair of genetic variants, a partitioning of the set of variants in a locus into genes, and the kinship matrix which indicates the genetic similarity between each pair of individuals. Marginal statistics are computed using methods that correct for population structure.  We consider a variant to be causal when the variant is responsible for the association signal at a locus and aim to discriminate these variants from ones that are correlated due to LD.

In model organisms, the large stretches of LD regions result in a large number of variants associated in each region, thus making CAVIAR computationally

infeasible. Instead of producing a rho causal set of SNPs, CAVIAR-gene detects a ‘q causal gene set’ which is a set of genes in the locus that will contain the actual causal genes with probability of at least q.

For further details of our new method, CAVIAR-gene, view our full paper here:

Hormozdiari, Farhad; Kichaev, Gleb; Yang, Wen-Yun Y; Pasaniuc, Bogdan; Eskin, Eleazar

Identification of causal genes for complex traits. Journal Article

In: Bioinformatics, 31 (12), pp. i206-i213, 2015, ISSN: 1367-4811.

Abstract | Links | BibTeX

How much does part of a genome contribute to a trait?

Both genetic and environmental factors contribute to a trait.  The genetic factors which contribute to a trait are typically spread over the genome.  Emrah Kostem in our group recently published a paper on estimating how much a specific genomic region (such as a single chromosome) contributes to a trait(10.1016/j.ajhg.2013.03.010) and released a software for performing this analysis called HEIDI which is available at http://genetics.cs.ucla.edu/heritability/.  This type of analysis is referred to as “partitioning heritability into the contributions of genomic regions.”

Estimating the heritability of a trait, e.g., measuring the influence of nature vs. nurture, has been a fundamental question in genetics. Traditionally, heritabilities were estimated using related individuals with known pedigrees such as twins or family cohorts. With the availability of high-throughput genomic technologies, it has been shown that heritabilities to those similar to the traditionally estimated can be obtained from genome-wide association study (GWAS) datasets utilizing unrelated individuals(10.1038/ng.608). In these approaches, the genetic similarities, or kinships, among the individuals are computed from the observed spectrum of the SNPs rather than inferring them from a given pedigree data.

Additionally, high-throughput SNP data makes it also possible to estimate local genetic similarities, which has recently been used to partition the heritability of a trait into the contributions of genomic regions(10.1038/ng.823). A naive approach estimates the heritability contributions using a linear mixed model (LMM) approach, where each region is modeled using a separate variance component.

We presented a method called HEIDI (Heritability Estimations Distributed) to improve the accuracy and computational efficiency of partitioning the heritability of a trait into the contributions of genomic regions. We show that the naive approach is not accurate for large number of regions and also does not scale for more than several partitions per chromosome in a study with 5000 individuals. We proposed an alternative approach, where the heritability contribution of a region is obtained using a model that includes the region and its genetic complement, or the rest of the genome. The advantage of using a two-component model is that it is computationally efficient and fast to fit. Additionally, it also makes it possible to parallelize the heritability estimations, where the computation of each region can be performed separately across computers.

We show the estimates of heritability contributions is inflated when the region and its genetic complement have SNPs that are in linkage disequilibrium (LD) and introduce a normalization procedure to mitigate the effect of LD. We normalize the contributions of the chromosomes such that their sum equals to the genome-wide heritability estimate and in each chromosome the regions’ contributions are normalized that sum up to the chromosome contribution.

The full citation to the paper is:

Kostem, Emrah; Eskin, Eleazar

Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Journal Article

Am J Hum Genet, 92 (4), pp. 558-64, 2013, ISSN: 1537-6605.

Abstract | Links | BibTeX

Bibliography