The Multivariate Normal Distribution Framework for Analyzing Association Studies: Overview

The use of the multivariate normal (MVN) model has been a powerful tool in our groups research and it has been utilized in many of our papers. Jose Lozano (University of the Basque Country, San Sebastian, Spain), along with Eleazar Eskin and three ZarLab alumni—Farhad Hormozdiari (postdoc at Harvard), Jong Wha (Joanne) Joo (faculty at Dongguk University in Seoul), and Buhm Han (faculty at University of Ulsan College of Medicine in Seoul)—recently published a review of the multivariate normal (MVN) distribution framework in genome-wide association studies (GWAS) studies.

Genome-wide association studies (GWAS) have discovered thousands of variants involved in common human diseases. In these studies, frequencies of genetic variants are compared between a population of individuals with a disease (cases) and a population of healthy individual controls). Any variant that has a significantly different frequency between the two populations is considered an associated variant.

A major challenge in the analysis of GWAS studies is the fact that human population history causes nearby genetic variants in the genome to be correlated with each other. In this review, we demonstrate how to utilize the MVN distribution to explicitly take into account the correlation between genetic variants and provide a comprehensive framework for analysis of GWAS.

In this paper, we show how the MVN framework can be applied to perform association testing, correct for multiple hypothesis, testing, estimate statistical power, and perform fine mapping and imputation. In future blog posts, we will highlight different ways the MVN framework can be used in association studies.

An illustration of the multivariate normal model (a) Type I Error (b) Power.

Many of the authors are the alumni of the group who pioneered the use of the MVN in various problems in association studies. Here is a list of papers that our group published using the MVN framework:

Sorry, no publications matched your criteria.

  • Farhad Hormozdiari, Anthony Zhu, Gleb Kichaev, Chelsea J.-T. Ju, Ayellet V. Segre, Jong Wha J. Joo, Hyejung Won, Sriram Sankararaman, Bogdan Pasaniuc, Sagiv Shifman, and Eleazar Eskin. Widespread allelic heterogeneity in complex traits. The American Journal of Human Genetics, 100(5):789{802, may 2017.
  • Yue Wu, Farhad Hormozdiari, Jong Wha J. Joo, and Eleazar Eskin. Improving imputation accuracy by inferring causal variants in genetic studies. In Lecture Notes in Computer Science, pages 303{317. Springer International Publishing, 2017.

The paper was written by Jose A. Lozano, Farhad Hormozdiari, Jong Wha (Joanne) Joo, Buhm Han, and Eleazar Eskin, and it is available at: https://www.biorxiv.org/content/early/2017/10/28/208199.

The full citation to our paper is:

Jose A. Lozano, Farhad Hormozdiari, Jong Wha (Joanne) Joo, Buhm Han, Eleazar Eskin. 2017. The Multivariate Normal Distribution Framework for Analyzing Association Studies. bioRxiv doi: https://doi.org/10.1101/208199.

Applying meta-analysis to genotype-tissue expression data from multiple tissues to identify eQTLs and increase the number of eGenes

Dat Duong, a graduate student in our lab, developed a novel method that will help find more eQTLs and eGenes in gene expression data from many tissues. A paper presenting his method is published in an upcoming issue of Bioinformatics.

Genome-wide association studies (GWAS) seek links between single-nucleotide polymorphisms (SNPs) and traits or diseases. SNPs are the most commonly occurring sources of variation in the human genome. Many SNPs identified by GWAS are located in intergenic regions, stretches of DNA sequences located between genes. SNPs identified in these primarily noncoding regions often do not have an obvious relationship to the disease phenotype. Other lines of evidence, such as gene expression, are required to explore this relationship and learn about disease function.

Gene expression, an intermediate phenotype between a causal SNP and a disease, can be used to interpret positive results produced by a GWAS. Common data types include expression quantitative trait loci (eQTLs), genetic variants associated with gene expression in particular tissue types, and eGenes, genes whose expression levels are associated with genetic variants. Both eQTL studies and GWAS focus on SNPs, but eQTL studies may provide biological insights into the disease development mechanism. For this reason, we pay special attention to the variants that are eQTLs or eGenes and have strong association signals identified by GWAS.

Multi-tissue gene expression datasets like the Gene Tissue Expression (GTEx) data are used to find eQTLs and eGenes. However, these datasets have small sample sizes in some tissues. Many meta-analysis methods have been designed to increase power for finding eQTLs and eGenes by combining gene expression data across many tissues However, these techniques cannot scale to datasets containing many tissue types, like the GTEx data. Such methods also ignore a biological principle that the same variant may be associated with the same gene across similar tissues.

 

Venn diagram of the numbers of eGenes found by existing methods and RECOV, along with correlation matrices comparing methods. For more information, read our full paper.

To leverage the analytical power of eQTLs and eGenes in association studies, Duong and his team developed a new meta-analysis method named RECOV. Based on the principle that a SNP may have similar effect on the same gene in related tissues, RECOV can be applied to large gene expression datasets and can analyze all 44 tissues present in the GTEx data.

In our Bioinformatics paper, we use simulated datasets to show that RECOV has a correct false positive rate. When applied to real multi-tissue expression data from the GTEx dataset, RECOV detects 3% more eGenes than previous methods. RECOV is a general framework for meta-analysis that can be used with any COV matrix. We hope this software will be used by other researchers in the scientific community!

RECOV was developed by Dat Duong. The source code for RECOV is freely available at: https://github.com/datduong/RECOV.

Our paper can be downloaded at Bioinformatics: https://academic.oup.com/bioinformatics/article/33/14/i67/3953939/Applying-meta-analysis-to-genotype-tissue

 

The full reference for our paper is:
Duong, D., Gai, L., Snir, S., Kang, E.Y., Han, B., Sul, J.H. and Eskin, E., 2017. Applying meta-analysis to Genotype-Tissue Expression data from multiple tissues to identify eQTLs and increase the number of eGenes. Bioinformatics, 33(14), pp.i67-i74.

Widespread Allelic Heterogeneity in Complex Traits

This week, our group published a paper in the American Journal of Human Genetics that presents a new computational method for improving the accuracy of genome wide association studies. ZarLab alumni Farhad Hormozdiari (PhD, 2016) developed the method, CAVIAR (CAusal Variants Identification in Associated Regions), a statistical framework that quantifies the probability of each variant to be causal while allowing an arbitrary number of causal variants.

Genome-wide association studies (GWASs) identify genetic variants associated with diseases and traits. Recent successes in GWASs make it possible to address important questions about the genetic architecture of complex traits, such as allele frequency and effect size. A more comprehensive understanding of these aspects will guide the development of new methods for fine mapping and association mapping of complex traits—and the discovery of new biomarkers for disease diagnosis and treatment.

One lesser-known aspect of complex traits is the extent of allelic heterogeneity (AH). Allelic heterogeneity occurs when different mutations at the same locus affects the same phenotype. AH is very common in Mendelian traits, but we know little about the extent to which AH contributes to common, complex disease. Undetected AH could potentially bias results of an association study, leading to false positive results.

Levels of Allelic Heterogeneity in eQTL Studies. For more information, see our paper.

In order to take AH into account while conducting a GWAS, we developed a computational method to infer the probability of AH. Our method quantifies the number of independent causal variants at a locus that can be responsible for the observed association signals detected in a GWAS. Our method is incorporated into the CAVIAR approach, and it is based on the principle of jointly analyzing association signals (i.e., summary level Z-scores) and LD structure in order to estimate the number of causal variants.

Our results show that our method is more accurate than the standard conditional method (CM). We applied our novel method to three GWASs and four expression quantitative trait loci (eQTL) datasets. We identified a total of 4,152 loci with strong evidence of the presence of AH. The proportion of all loci with identified AH is 4%–23% in eQTLs, 35% in GWASs of high-density lipoprotein (HDL), and 23% in GWASs of schizophrenia. For eQTLs, we observed a strong correlation between sample size and the proportion of loci with AH, indicating that statistical power prevents identification of AH in other loci.

One of the main benefits of our method is that it requires only summary statistics. Summary statistics of a GWAS or eQTL study are widely available, so our method is applicable to most existing datasets. We have shown that AH is widespread and more common than previously estimated in complex traits, both in GWASs and eQTL studies.

Our results highlight the importance of accounting for the presence of multiple causal variants when characterizing the mechanism of genetic association in complex traits. Falling to account for AH can reduce the power to detect true causal variants and can explain the limited success of fine mapping of GWASs.

In a related study, researchers at University of California, Irvine, and University of Kansas, identified an analogous signal in eQTLs from genetic sequencing of flies. King et al. (2014) observe that the vast majority of genes with eQTL are more consistent with heterogeneity than bi-allelism. Read more about this related study, “Genetic Dissection of the Drosophila melanogaster Female Head Transcriptome Reveals Widespread Allelic Heterogeneity.”

CAVIAR was created by Farhad Hormozdiari, Emrah Kostem, Eun Yong Kang, Bogdan Pasaniuc and Eleazar Eskin. Software is freely available for download: http://genetics.cs.ucla.edu/caviar/

For more information, see our full paper, which can be accessed through AJHGhttp://www.cell.com/ajhg/abstract/S0002-9297(17)30149-0

The full citation of our paper:
Hormozdiari F, Zhu A, Kichaev G, Ju CJ, Segrè AV, Joo JW, Won H, Sankararaman S, Pasaniuc B, Shifman S, Eskin E. Widespread allelic heterogeneity in complex traits. The American Journal of Human Genetics. 2017 May 4;100(5):789-802.