The Multivariate Normal Distribution Framework for Analyzing Association Studies: Overview

The use of the multivariate normal (MVN) model has been a powerful tool in our groups research and it has been utilized in many of our papers. Jose Lozano (University of the Basque Country, San Sebastian, Spain), along with Eleazar Eskin and three ZarLab alumni—Farhad Hormozdiari (postdoc at Harvard), Jong Wha (Joanne) Joo (faculty at Dongguk University in Seoul), and Buhm Han (faculty at University of Ulsan College of Medicine in Seoul)—recently published a review of the multivariate normal (MVN) distribution framework in genome-wide association studies (GWAS) studies.

Genome-wide association studies (GWAS) have discovered thousands of variants involved in common human diseases. In these studies, frequencies of genetic variants are compared between a population of individuals with a disease (cases) and a population of healthy individual controls). Any variant that has a significantly different frequency between the two populations is considered an associated variant.

A major challenge in the analysis of GWAS studies is the fact that human population history causes nearby genetic variants in the genome to be correlated with each other. In this review, we demonstrate how to utilize the MVN distribution to explicitly take into account the correlation between genetic variants and provide a comprehensive framework for analysis of GWAS.

In this paper, we show how the MVN framework can be applied to perform association testing, correct for multiple hypothesis, testing, estimate statistical power, and perform fine mapping and imputation. In future blog posts, we will highlight different ways the MVN framework can be used in association studies.

An illustration of the multivariate normal model (a) Type I Error (b) Power.

Many of the authors are the alumni of the group who pioneered the use of the MVN in various problems in association studies. Here is a list of papers that our group published using the MVN framework:

Joo, Jong Wha J; Hormozdiari, Farhad; Han, Buhm; Eskin, Eleazar

Multiple testing correction in linear mixed models. Journal Article

In: Genome Biol, 17 (1), pp. 62, 2016, ISSN: 1474-760X.

Abstract | Links | BibTeX

Hormozdiari, Farhad ; Kang, Eun Yong ; Bilow, Michael ; Ben-David, Eyal ; Vulpe, Chris ; McLachlan, Stela ; Lusis, Aldons J; Han, Buhm ; Eskin, Eleazar

Imputing Phenotypes for Genome-wide Association Studies. Journal Article

In: Am J Hum Genet, 99 (1), pp. 89-103, 2016, ISSN: 1537-6605.

Abstract | Links | BibTeX

Duong, Dat ; Zou, Jennifer ; Hormozdiari, Farhad ; Sul, Jae Hoon ; Ernst, Jason ; Han, Buhm ; Eskin, Eleazar

Using genomic annotations increases statistical power to detect eGenes. Journal Article

In: Bioinformatics, 32 (12), pp. i156-i163, 2016, ISSN: 1367-4811.

Abstract | Links | BibTeX

Hormozdiari, Farhad; van de Bunt, Martijn; Segrè, Ayellet V; Li, Xiao; Joo, Jong Wha J; Bilow, Michael; Sul, Jae Hoon; Sankararaman, Sriram; Pasaniuc, Bogdan; Eskin, Eleazar

Colocalization of GWAS and eQTL Signals Detects Target Genes. Journal Article

In: Am J Hum Genet, 2016, ISSN: 1537-6605.

Abstract | Links | BibTeX

Joo, Jong Wha J; Kang, Eun Yong; Org, Elin; Furlotte, Nick; Parks, Brian; Hormozdiari, Farhad; Lusis, Aldons J; Eskin, Eleazar

Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure. Journal Article

In: Genetics, 204 (4), pp. 1379-1390, 2016, ISSN: 1943-2631.

Abstract | Links | BibTeX

Hormozdiari, Farhad; Kichaev, Gleb; Yang, Wen-Yun Y; Pasaniuc, Bogdan; Eskin, Eleazar

Identification of causal genes for complex traits. Journal Article

In: Bioinformatics, 31 (12), pp. i206-i213, 2015, ISSN: 1367-4811.

Abstract | Links | BibTeX

Hormozdiari, Farhad; Kichaev, Gleb; Yang, Wen-Yun Y; Pasaniuc, Bogdan; Eskin, Eleazar

Identification of causal genes for complex traits. Journal Article

In: Bioinformatics, 31 (12), pp. i206-i213, 2015, ISSN: 1367-4811.

Abstract | Links | BibTeX

Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar

Identifying causal variants at Loci with multiple signals of association. Journal Article

In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631.

Abstract | Links | BibTeX

Kichaev, Gleb; Yang, Wen-Yun Y; Lindstrom, Sara ; Hormozdiari, Farhad ; Eskin, Eleazar ; Price, Alkes L; Kraft, Peter ; Pasaniuc, Bogdan

Integrating functional data to prioritize causal variants in statistical fine-mapping studies. Journal Article

In: PLoS Genet, 10 (10), pp. e1004722, 2014, ISSN: 1553-7404.

Abstract | Links | BibTeX

Darnell, Gregory; Duong, Dat ; Han, Buhm ; Eskin, Eleazar

Incorporating prior information into association studies. Journal Article

In: Bioinformatics, 28 (12), pp. i147-i153, 2012, ISSN: 1367-4811.

Abstract | Links | BibTeX

Flint, Jonathan; Eskin, Eleazar

Genome-wide association studies in mice Journal Article

In: Nature Reviews Genetics, 13 (11), pp. 807-17, 2012, ISSN: 1471-0064.

Abstract | Links | BibTeX

Han, Buhm; Kang, Hyun Min ; Eskin, Eleazar

Rapid and accurate multiple testing correction and power estimation for millions of correlated markers. Journal Article

In: PLoS Genet, 5 (4), pp. e1000456, 2009, ISSN: 1553-7404.

Abstract | Links | BibTeX

Eskin, Eleazar

Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information. Journal Article

In: Genome Res, 18 (4), pp. 653-60, 2008, ISSN: 1088-9051.

Abstract | Links | BibTeX

Eskin, Eleazar

Increasing Power in Association Studies by Using Linkage Disequilibrium Structure and Molecular Function as Prior Information Conference

Lecture Notes in Computer Science, 4955/2008 , Lecture Notes in Computer Science Springer Berlin / Heidelberg, 2008, ISSN: 0302-9743 (Print) 1611-3349 (Online).

Abstract | Links | BibTeX

  • Farhad Hormozdiari, Anthony Zhu, Gleb Kichaev, Chelsea J.-T. Ju, Ayellet V. Segre, Jong Wha J. Joo, Hyejung Won, Sriram Sankararaman, Bogdan Pasaniuc, Sagiv Shifman, and Eleazar Eskin. Widespread allelic heterogeneity in complex traits. The American Journal of Human Genetics, 100(5):789{802, may 2017.
  • Yue Wu, Farhad Hormozdiari, Jong Wha J. Joo, and Eleazar Eskin. Improving imputation accuracy by inferring causal variants in genetic studies. In Lecture Notes in Computer Science, pages 303{317. Springer International Publishing, 2017.

The paper was written by Jose A. Lozano, Farhad Hormozdiari, Jong Wha (Joanne) Joo, Buhm Han, and Eleazar Eskin, and it is available at: https://www.biorxiv.org/content/early/2017/10/28/208199.

The full citation to our paper is:

Jose A. Lozano, Farhad Hormozdiari, Jong Wha (Joanne) Joo, Buhm Han, Eleazar Eskin. 2017. The Multivariate Normal Distribution Framework for Analyzing Association Studies. bioRxiv doi: https://doi.org/10.1101/208199.

Using genomic annotations increases statistical power to detect eGenes

Our group developed a novel method for detecting eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Identification of eGenes is increasingly important to studies of expression quantitative trait loci (eQTLs), the genetic variants that affect gene expression. Mapped eGenes help guide eQTL studies of complex human disease. However, standard approaches cannot efficiently detect these complex features in today’s large genomic datasets. Func-eGene, which we describe and test in a recent Bioinformatics paper, significantly increases the statistical power of existing association study methods and detects more eGenes in comparison to standard approaches.

Standard statistical methods for classifying a gene as an eGene first perform association testing at all variants near the gene of interest, then use a permutation test to conduct multiple-testing correction for results. The permutation test effectively corrects for potential biases introduced by multiple testing and obtains a p value for each gene. However, the permutation test is computationally inefficient when processing the increasingly large sample sizes of today’s eQTL datasets and has become a computational bottleneck in eQTL studies.

Our new approach, Func-eGene, incorporates genomic annotation of variants to improve the computing power of eQTL studies. Variants located near gene transcription sites (TSSs), or near some histone modifications, often regulate gene expression. Standard approaches do not consider genomic annotations, but we found that annotation of these variants can help locate and associate more causal variants using less time and computing power. In order to do this, we expand upon the standard multithreshold association test that specifies different significance thresholds for each variant when correcting for multiple testing. Func-eGene increases power by assigning lower significance thresholds to variants that are likely to contribute to gene expression.

However, this association test still depends on the time-consuming permutation test and requires a known prior based on annotation for genetic variants. Func-eGene avoids these difficulties by reducing runtime and selecting an appropriate prior. To reduce runtime, we replace the permutation test with the Mvn-sampling procedure described in Sul et al. (2015). To find an appropriate prior, we run a grid search over possible sets of scores assigned to annotation categories. Func-eGene then seeks a set of scores that maximizes the number of eGenes and uses a cross-validation strategy to avoid data re-use and over-fitting. Thus, there are two ways to apply Func-eGene to eQTL data. Permutation Func-eGene uses the traditional permutation test to calculate the null density of the observed statistic, whereas Mvn Func-eGene relies on the Mvn-sampling procedure.

We applied our method to the liver Genotype-Tissue Expression (GTEx) dataset. We used genomic annotations of the following variants: distance from TSSs, DNase hypersensitivity sites, and six histone modifications. Notably, the distance from TSS annotation detected the highest number of candidate eGenes; using this annotation, our new method discovered 50% more candidate eGenes when compared to the standard permutation method. Our simulations show that Func-eGene successfully control the rate of false-positive associations when using either the permutation or the Mvn procedure. However, implementing Func-eGene with a traditional permutation test is inefficient. Instead, we can obtain the same results with considerably faster runtime when using Mvn sampling.

f1-large

Graphs comparing eGene detection and statistical power of permutation and mvn approaches. (a) Q–Q plots of the uniform density quantiles against the simulated eGene P-value quantiles using Func-eGene at the gene ENSG00000204219.5 under the null hypothesis. (b) Func-eGene simulated statistical power at the gene ENSG00000204219.5

 

This project was led by Dat Duong and involved Jennifer Zou, Farhad Hormozdiari, and Jae Hoon Sul. The article is available at: http://bioinformatics.oxfordjournals.org/content/32/12/i156.abstract

The full citation to our paper is: 

Duong, Dat ; Zou, Jennifer ; Hormozdiari, Farhad ; Sul, Jae Hoon ; Ernst, Jason ; Han, Buhm ; Eskin, Eleazar

Using genomic annotations increases statistical power to detect eGenes. Journal Article

In: Bioinformatics, 32 (12), pp. i156-i163, 2016, ISSN: 1367-4811.

Abstract | Links | BibTeX

FUNC-eGene was developed by Dat Duong and is available for download at: https://github.com/datduong/FUNC-eGene