Widespread Allelic Heterogeneity in Complex Traits

This week, our group published a paper in the American Journal of Human Genetics that presents a new computational method for improving the accuracy of genome wide association studies. ZarLab alumni Farhad Hormozdiari (PhD, 2016) developed the method, CAVIAR (CAusal Variants Identification in Associated Regions), a statistical framework that quantifies the probability of each variant to be causal while allowing an arbitrary number of causal variants.

Genome-wide association studies (GWASs) identify genetic variants associated with diseases and traits. Recent successes in GWASs make it possible to address important questions about the genetic architecture of complex traits, such as allele frequency and effect size. A more comprehensive understanding of these aspects will guide the development of new methods for fine mapping and association mapping of complex traits—and the discovery of new biomarkers for disease diagnosis and treatment.

One lesser-known aspect of complex traits is the extent of allelic heterogeneity (AH). Allelic heterogeneity occurs when different mutations at the same locus affects the same phenotype. AH is very common in Mendelian traits, but we know little about the extent to which AH contributes to common, complex disease. Undetected AH could potentially bias results of an association study, leading to false positive results.

Levels of Allelic Heterogeneity in eQTL Studies. For more information, see our paper.

In order to take AH into account while conducting a GWAS, we developed a computational method to infer the probability of AH. Our method quantifies the number of independent causal variants at a locus that can be responsible for the observed association signals detected in a GWAS. Our method is incorporated into the CAVIAR approach, and it is based on the principle of jointly analyzing association signals (i.e., summary level Z-scores) and LD structure in order to estimate the number of causal variants.

Our results show that our method is more accurate than the standard conditional method (CM). We applied our novel method to three GWASs and four expression quantitative trait loci (eQTL) datasets. We identified a total of 4,152 loci with strong evidence of the presence of AH. The proportion of all loci with identified AH is 4%–23% in eQTLs, 35% in GWASs of high-density lipoprotein (HDL), and 23% in GWASs of schizophrenia. For eQTLs, we observed a strong correlation between sample size and the proportion of loci with AH, indicating that statistical power prevents identification of AH in other loci.

One of the main benefits of our method is that it requires only summary statistics. Summary statistics of a GWAS or eQTL study are widely available, so our method is applicable to most existing datasets. We have shown that AH is widespread and more common than previously estimated in complex traits, both in GWASs and eQTL studies.

Our results highlight the importance of accounting for the presence of multiple causal variants when characterizing the mechanism of genetic association in complex traits. Falling to account for AH can reduce the power to detect true causal variants and can explain the limited success of fine mapping of GWASs.

In a related study, researchers at University of California, Irvine, and University of Kansas, identified an analogous signal in eQTLs from genetic sequencing of flies. King et al. (2014) observe that the vast majority of genes with eQTL are more consistent with heterogeneity than bi-allelism. Read more about this related study, “Genetic Dissection of the Drosophila melanogaster Female Head Transcriptome Reveals Widespread Allelic Heterogeneity.”

CAVIAR was created by Farhad Hormozdiari, Emrah Kostem, Eun Yong Kang, Bogdan Pasaniuc and Eleazar Eskin. Software is freely available for download: http://genetics.cs.ucla.edu/caviar/

For more information, see our full paper, which can be accessed through AJHGhttp://www.cell.com/ajhg/abstract/S0002-9297(17)30149-0

The full citation of our paper:
Hormozdiari F, Zhu A, Kichaev G, Ju CJ, Segrè AV, Joo JW, Won H, Sankararaman S, Pasaniuc B, Shifman S, Eskin E. Widespread allelic heterogeneity in complex traits. The American Journal of Human Genetics. 2017 May 4;100(5):789-802.

IPED2: Inheritance Path based Pedigree Reconstruction Algorithm for Complicated Pedigrees

Our group, in an effort led by former UCLA PhD student Dan He, developed an algorithm for reconstructing pedigrees with genotype data. This novel approach is presented in a paper recently published in IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Pedigree inference plays an important role in population genetics. Pedigrees, commonly known as family trees, represent genetic relationships between individuals of a family. A pedigree diagram provides a model to compute the inheritance probability for the observed genotype and encodes all possible inheritance options for an allele in an individual. Pedigree reconstruction methods face several challenges. First, there can be an exponential number of possible pedigree graphs, and, second, the number of unknown ancestors can become very large as the height of the pedigree increases.

Examples of sequentially labeling the half-sibling graph. For more information, see our paper.

Examples of sequentially labeling the half-sibling graph. For more information, see our paper.

Our project uses genotype data to reconstruct pedigrees with computational efficiency despite these challenges. Our previous method, IPED, is the only known algorithm scalable to large pedigrees with reasonable accuracy for cases involving both outbreeding and inbreeding. IPED starts from extant individuals and reconstructs the pedigree generation by generation backwards in time. For each generation, IPED predicts the pairwise relationships between the individuals at the current generation and create parents for them according to their relationships.

Existing methods, including IPED, only consider pedigrees with simple structure; they cannot handle populations where, for example, two children share only one parent. To improve pedigree reconstruction when populations have complex structure, we proposed the novel method IPED2. Our approach uses a new statistical test to detect half-sibling relationships and a new graph-based algorithm to reconstruct the pedigree when half-siblings are allowed.

In order to test the performance of our method on complicated pedigrees, we use simulated pedigrees with different parameter settings and, instead of genotype data, we simulate haplotypes
directly. Our experiments show that IPED2 outperforms IPED and two other existing approaches for cases where there are half-siblings.

To our knowledge, this is the first method that can, using just genotype data, reconstruct pedigrees with half-siblings and inbreeding. IPED2 is also scalable to large pedigrees. In future work, we would like to consider additional genetic actions, such as insertion, deletion, and replacement, to resolve the conflicts. We also plan to refine IPED2 to consider cases where genotypes of ancestral individuals are known and where genotypes of extant individuals that are not on the lowest generations are known.

For more information, see our paper, which is available for download through Bioinformaticshttp://ieeexplore.ieee.org/abstract/document/7888513/.

In addition, the open source implementation of IPED2, which was developed by Dan He, is freely available for download at http://genetics.cs.ucla.edu/Dan/Software/IPED2.html.

The full citation to our paper is:
He, D., Wang, Z., Parida, L. and Eskin, E., 2017. IPED2: Inheritance path based pedigree reconstruction algorithm for complicated pedigrees. IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Incorporating prior information into association studies

Genome-wide association studies (GWAS) seek to identify genetic variants involved in specific traits. GWAS are advantageous for linking variants with traits, because they interrogate the genome in a uniform way. In other words, they examine the whole genome without a preconceived notion of where the associations may lie.

However, we now know a lot about the putative function of genetic variants due to tremendous progress in functional genomics. In many cases, we even know which variants are more likely to be involved in disease when compared to others. Advancements in our understanding of functional genomics motivate the strategic incorporation of prior information in GWAS.

Our group has been interested in this problem for many years. One challenge to addressing this problem is that the widely utilized approach for GWAS involves evaluating an association statistic at each single nucleotide polymorphism (SNP), and these methods take into account only one SNP at a time. The results are then adjusted for multiple testing, and an association is identified if a statistic exceeds a certain threshold. This approach can be described as a frequentist approach. On the other hand, one can incorporate prior information on which SNPs are likely to be the causal variants affecting the trait. This approach is inherently a Bayesian concept. Reconciling these two approaches is not straightforward.

Average power under varying relative risks. For more information, see our paper.

In a 2008 paper published in Genome Research, our group proposed a modification of the multiple testing framework to address this problem. Instead of using the same specific threshold for all of the association statistics, we use a different threshold for each association statistic, where the thresholds are adjusted based on the prior information. Our method takes advantage of the correlation structure by considering multiple markers within a region. In our paper, we demonstrate how to set the thresholds in order to optimally utilize prior information and maximize statistical power.

Using prior information in genetic association studies increases power over traditional association studies while maintaining the same overall false-positive rate. Compared to standard methods, our approach is equally simple to apply to association studies, produces interpretable results as p-values, and is optimal in its use of prior information in regards to statistical power.

In 2012, we extended this work to use only tag SNPs for the putative causal variant. This project was developed by Gregory Darnell (then UCLA undergraduate, now PhD student at Princeton University), Dat Duong (then UCLA undergraduate, now UCLA PhD student), and Buhm Han.

More recently, we have applied this framework to incorporate functional information in analysis of eQTL data. In this case, incorporating genomic annotation of variants significantly increases the statistical power of existing eQTL methods and detects more eGenes in comparison to standard approaches. Read the blog post on this paper, and download the full article.

For more information on our general approach, see our paper, which is available for download through Bioinformatics:
In addition, the open source implementation of our 2012 paper, MASA, which was developed by Greg Darnell and Dat Duong, is freely available for download at http://masa.cs.ucla.edu/.

The full citations to our papers on this topic are:

Darnell, Gregory; Duong, Dat ; Han, Buhm ; Eskin, Eleazar

Incorporating prior information into association studies. Journal Article

In: Bioinformatics, 28 (12), pp. i147-i153, 2012, ISSN: 1367-4811.

Abstract | Links | BibTeX

Eleazar Eskin. “Increasing Power in Association Studies by using Linkage Disequilibrium
Structure and Molecular Function as Prior Information.” Genome Research.
18(4):653-60 Special Issue Proceedings of the 12th Annual Conference on Research
in Computational Biology (RECOMB-2008), 2008.