Widespread Allelic Heterogeneity in Complex Traits

This week, our group published a paper in the American Journal of Human Genetics that presents a new computational method for improving the accuracy of genome wide association studies. ZarLab alumni Farhad Hormozdiari (PhD, 2016) developed the method, CAVIAR (CAusal Variants Identification in Associated Regions), a statistical framework that quantifies the probability of each variant to be causal while allowing an arbitrary number of causal variants.

Genome-wide association studies (GWASs) identify genetic variants associated with diseases and traits. Recent successes in GWASs make it possible to address important questions about the genetic architecture of complex traits, such as allele frequency and effect size. A more comprehensive understanding of these aspects will guide the development of new methods for fine mapping and association mapping of complex traits—and the discovery of new biomarkers for disease diagnosis and treatment.

One lesser-known aspect of complex traits is the extent of allelic heterogeneity (AH). Allelic heterogeneity occurs when different mutations at the same locus affects the same phenotype. AH is very common in Mendelian traits, but we know little about the extent to which AH contributes to common, complex disease. Undetected AH could potentially bias results of an association study, leading to false positive results.

Levels of Allelic Heterogeneity in eQTL Studies. For more information, see our paper.

In order to take AH into account while conducting a GWAS, we developed a computational method to infer the probability of AH. Our method quantifies the number of independent causal variants at a locus that can be responsible for the observed association signals detected in a GWAS. Our method is incorporated into the CAVIAR approach, and it is based on the principle of jointly analyzing association signals (i.e., summary level Z-scores) and LD structure in order to estimate the number of causal variants.

Our results show that our method is more accurate than the standard conditional method (CM). We applied our novel method to three GWASs and four expression quantitative trait loci (eQTL) datasets. We identified a total of 4,152 loci with strong evidence of the presence of AH. The proportion of all loci with identified AH is 4%–23% in eQTLs, 35% in GWASs of high-density lipoprotein (HDL), and 23% in GWASs of schizophrenia. For eQTLs, we observed a strong correlation between sample size and the proportion of loci with AH, indicating that statistical power prevents identification of AH in other loci.

One of the main benefits of our method is that it requires only summary statistics. Summary statistics of a GWAS or eQTL study are widely available, so our method is applicable to most existing datasets. We have shown that AH is widespread and more common than previously estimated in complex traits, both in GWASs and eQTL studies.

Our results highlight the importance of accounting for the presence of multiple causal variants when characterizing the mechanism of genetic association in complex traits. Falling to account for AH can reduce the power to detect true causal variants and can explain the limited success of fine mapping of GWASs.

In a related study, researchers at University of California, Irvine, and University of Kansas, identified an analogous signal in eQTLs from genetic sequencing of flies. King et al. (2014) observe that the vast majority of genes with eQTL are more consistent with heterogeneity than bi-allelism. Read more about this related study, “Genetic Dissection of the Drosophila melanogaster Female Head Transcriptome Reveals Widespread Allelic Heterogeneity.”

CAVIAR was created by Farhad Hormozdiari, Emrah Kostem, Eun Yong Kang, Bogdan Pasaniuc and Eleazar Eskin. Software is freely available for download: http://genetics.cs.ucla.edu/caviar/

For more information, see our full paper, which can be accessed through AJHGhttp://www.cell.com/ajhg/abstract/S0002-9297(17)30149-0

The full citation of our paper:
Hormozdiari F, Zhu A, Kichaev G, Ju CJ, Segrè AV, Joo JW, Won H, Sankararaman S, Pasaniuc B, Shifman S, Eskin E. Widespread allelic heterogeneity in complex traits. The American Journal of Human Genetics. 2017 May 4;100(5):789-802.

The Genetic Basis of Host Preference and Resting Behavior in the Major African Malaria Vector, Anopheles arabiensis

We recently published the first study to report a genetic component to host choice behavior in the major malaria vector Anopheles arabiensis. In a collaboration with the University of California Davis, University of Glasgow, and the Environmental Health and Ecological Sciences Group, Ifakara Health Institute, Ifakara, United Republic of Tanzania, we assess the genetic basis for An. arabiensis host choice and resting behavior. We link human-fed behavior to allelic variation between the 3Ra inversion states. This effort was led by researchers at UC Davis, including Bradley Main, Yoosook Lee, Travis Collier, Anthony Cornel, Catelyn Nieman, Allison Weakley, and Gregory Lanzaro. Eleazar Eskin and Eun Yong Kang contributed data analysis and interpretation.

Mosquitoes that feed on human blood pose an enormous public health threat by transmitting numerous pathogens, such as dengue virus, Zika virus, and malaria. Together, these mosquito-borne diseases kill more than one million people per year. Human exposure to malaria is driven by variable mosquito behaviors such as: (1) propensity to feed on humans relative to other animals (anthropophily) and (2) preference for living in close proximity to humans, as reflected by biting and residing inside houses (endophily).

Our project focused on the potential for An. arabiensis, the only remaining malaria vector in many parts of Africa, to adapt its behavior to avoid control measures such as insecticide-treated nets and indoor residual sprays. To investigate the genetic basis of host choice and resting behavior, we sequenced the genomes of 23 human-fed and 25 cattle-fed mosquitoes collected both in-doors and out-doors in the Kilombero Valley, Tanzania. We tested for genetic associations with each of the four phenotypes: human-fed, cow-fed, resting indoors, and resting outdoors.

With these genomes, we identified a set of 4,820,851 segregating SNPs after imposing a minor allele frequency threshold of 10%. We estimated the genetic component (or “SNP heritability”) for each phenotype. Results suggest a genetic component for host choice and no genetic component for resting behavior.

To test for the existence of genetic structure within our set of 48 sequenced genomes, individuals were partitioned by genetic relatedness using a Principle Component Analysis (Genome-Wide Complex Trait Analysis software, GCTA) applied to all SNPs. Using this approach, we observed three discrete genetic clusters. We used a novel population-scale inversion genotyping method to identify an association between the standard arrangement of 3Ra (3R+) and cattle-fed An. arabiensis. We highlight two intriguing candidate genes within the 3Ra, including the odorant binding protein Obp5, and the odorant receptor Or65. The enrichment of 3R+ among cattle-fed mosquitoes provides support for a genetic component to host choice, which is consistent with the report that zoophily can be selected for.

Genetic variation explained by the 2Rb and 3Ra inversions. For more information, see our paper.

Our multiplex genotyping assays allowed us to directly estimate relationships between host choice and genotype in wild mosquitoes in a high-throughput and economical fashion. Given the importance of mosquito feeding and resting behavior to the effectiveness of malaria control and transmission, there is an urgent need to understand the underlying biological determinants of these behaviors and their short- and long-term impact on the effectiveness of current public health interventions.

For more information, see our paper, which is available for download through PLoS Genetics: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006303.

The full citation to our paper is:
Main, B.J., Lee, Y., Ferguson, H.M., Kreppel, K.S., Kihonda, A., Govella, N.J., Collier, T.C., Cornel, A.J., Eskin, E., Kang, E.Y. and Nieman, C.C., 2016. The Genetic Basis of Host Preference and Resting Behavior in the Major African Malaria Vector, Anopheles arabiensis. PLoS Genet, 12(9), p.e1006303.

 

Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure

Jong Wha (Joanne) Joo developed an approach to simultaneously analyze multiple phenotypes in a genome-wide association studies (GWAS) dataset. She introduces this new methodology, referred to as GAMMA (Generalized Analysis of Molecular variance for Mixed model Analysis), in a paper recently published in Genetics.

GWASs have identified many genetic variants involved in traits and development of human diseases by examining for correlation of a single phenotype and individual genotype one phenotype at a time. Since initial development of the standard GWAS approach, GWAS data collection has become larger in scale and higher in resolution. Today’s large-scale datasets include expression data and often contain thousands of phenotypes per individual. Performing the standard single-phenotype analysis on these datasets is slow and potentially fails to detect unmeasured aspects of complex biological networks.

Analyzing many phenotypes simultaneously increases the power to detect more variants and capture previously unmeasured aspects of the genome. However, standard GWAS approaches capable of simultaneously testing multiple phenotypes fail to account for the distorting effects of population structure, a phenomenon present in large cohorts that inevitably contain individuals sharing common ancestry from multiple populations. As a result, standard GWAS approaches either fail to detect true effects or produce many false positive identifications.

GAMMA is an efficient, robust approach capable of simultaneously analyzing many phenotypes while correcting for population structure. GAMMA uses the principles behind existing linear mixed models to analyze for many phenotypes simultaneously and a multiple regression technique to correct for population structure.

Joanne’s paper presents the results of testing GAMMA for accuracy in three scenarios: a simulated dataset containing population structure, a yeast dataset containing many trans-regulatory hotspots, and a complex gut microbiome dataset. In the simulated study using data implanted with true population structure effects, GAMMA accurately identifies these true effects without producing false positives. In the simulation with yeast data, GAMMA successfully corrected for the bias of technical artifacts such as batch effects and identified significant signals on most of the putative hotspots. In the third test, Joanne and her team assesses GAMMA’s ability to perform a multiple-phenotypes analysis with microbiome data. Here, results identified nine loci likely to have true biological mechanisms in the taxa.

In each scenario, results of GAMMA were compared to those of the standard t-test, EMMA, and MDMR. The standard t-test and EMMA failed to identify true variants, because the phenotypic effects in each example is smaller than the amount these methods are powered to detect. MDMR produced no significant signals in the yeast dataset and identified many false associations in the simulated and gut microbiome datasets. Both GAMMA and MDMR have sufficient power to detect small association signals in these complex datasets, but only GAMMA successfully corrects for population structure.

This project was led by Joanne Joo and involved Eun Yong Kang and Farhad Hormozdiari. The article is available at: http://www.genetics.org/content/204/4/1379.

GAMMA was developed by Joanne Joo, Eun Yong Kang, Elin Org, Nick Furlotte, Brian Parks, Aldons J. Lusis, and Eleazar Eskin. Visit the following page to download GAMMA: http://genetics.cs.ucla.edu/GAMMA/

The full citation to our paper is:

Sorry, no publications matched your criteria.

The results of GAMMA and three standard GWAS methods applied to a simulated dataset. The x-axis shows SNP locations and the y-axis shows log10p-value of associations between each SNP and all the genes. Blue arrows show the location of the true trans-regulatory hotspots.