The use of the multivariate normal (MVN) model has been a powerful tool in our groups research and it has been utilized in many of our papers. Jose Lozano (University of the Basque Country, San Sebastian, Spain), along with Eleazar Eskin and three ZarLab alumni—Farhad Hormozdiari (postdoc at Harvard), Jong Wha (Joanne) Joo (faculty at Dongguk University in Seoul), and Buhm Han (faculty at University of Ulsan College of Medicine in Seoul)—recently published a review of the multivariate normal (MVN) distribution framework in genome-wide association studies (GWAS) studies.
Genome-wide association studies (GWAS) have discovered thousands of variants involved in common human diseases. In these studies, frequencies of genetic variants are compared between a population of individuals with a disease (cases) and a population of healthy individual controls). Any variant that has a significantly different frequency between the two populations is considered an associated variant.
A major challenge in the analysis of GWAS studies is the fact that human population history causes nearby genetic variants in the genome to be correlated with each other. In this review, we demonstrate how to utilize the MVN distribution to explicitly take into account the correlation between genetic variants and provide a comprehensive framework for analysis of GWAS.
In this paper, we show how the MVN framework can be applied to perform association testing, correct for multiple hypothesis, testing, estimate statistical power, and perform fine mapping and imputation. In future blog posts, we will highlight different ways the MVN framework can be used in association studies.

An illustration of the multivariate normal model (a) Type I Error (b) Power.
Many of the authors are the alumni of the group who pioneered the use of the MVN in various problems in association studies. Here is a list of papers that our group published using the MVN framework:
Joo, Jong Wha J; Hormozdiari, Farhad; Han, Buhm; Eskin, Eleazar Multiple testing correction in linear mixed models. Journal Article In: Genome Biol, 17 (1), pp. 62, 2016, ISSN: 1474-760X. @article{Joo:GenomeBiol:2016, title = {Multiple testing correction in linear mixed models.}, author = {Jong Wha J. Joo and Farhad Hormozdiari and Buhm Han and Eleazar Eskin}, url = {http://dx.doi.org/10.1186/s13059-016-0903-6}, issn = {1474-760X}, year = {2016}, date = {2016-01-01}, journal = {Genome Biol}, volume = {17}, number = {1}, pages = {62}, address = {England}, abstract = {BACKGROUND: Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM. RESULTS: We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach. CONCLUSIONS: We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data}, keywords = {}, pubstate = {published}, tppubtype = {article} } BACKGROUND: Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM. RESULTS: We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach. CONCLUSIONS: We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data |
Hormozdiari, Farhad ; Kang, Eun Yong ; Bilow, Michael ; Ben-David, Eyal ; Vulpe, Chris ; McLachlan, Stela ; Lusis, Aldons J; Han, Buhm ; Eskin, Eleazar Imputing Phenotypes for Genome-wide Association Studies. Journal Article In: Am J Hum Genet, 99 (1), pp. 89-103, 2016, ISSN: 1537-6605. @article{Hormozdiari:AmJHumGenet:2016, title = {Imputing Phenotypes for Genome-wide Association Studies.}, author = {Hormozdiari, Farhad and Kang, Eun Yong and Bilow, Michael and Ben-David, Eyal and Vulpe, Chris and McLachlan, Stela and Lusis, Aldons J. and Han, Buhm and Eskin, Eleazar}, url = {https://www.ncbi.nlm.nih.gov/pubmed/27292110}, doi = {10.1016/j.ajhg.2016.04.013}, issn = {1537-6605}, year = {2016}, date = {2016-01-01}, journal = {Am J Hum Genet}, volume = {99}, number = {1}, pages = {89-103}, address = {United States}, abstract = {Genome-wide association studies (GWASs) have been successful in detecting variants correlated with phenotypes of clinical interest. However, the power to detect these variants depends on the number of individuals whose phenotypes are collected, and for phenotypes that are difficult to collect, the sample size might be insufficient to achieve the desired statistical power. The phenotype of interest is often difficult to collect, whereas surrogate phenotypes or related phenotypes are easier to collect and have already been collected in very large samples. This paper demonstrates how we take advantage of these additional related phenotypes to impute the phenotype of interest or target phenotype and then perform association analysis. Our approach leverages the correlation structure between phenotypes to perform the imputation. The correlation structure can be estimated from a smaller complete dataset for which both the target and related phenotypes have been collected. Under some assumptions, the statistical power can be computed analytically given the correlation structure of the phenotypes used in imputation. In addition, our method can impute the summary statistic of the target phenotype as a weighted linear combination of the summary statistics of related phenotypes. Thus, our method is applicable to datasets for which we have access only to summary statistics and not to the raw genotypes. We illustrate our approach by analyzing associated loci to triglycerides (TGs), body mass index (BMI), and systolic blood pressure (SBP) in the Northern Finland Birth Cohort dataset}, keywords = {}, pubstate = {published}, tppubtype = {article} } Genome-wide association studies (GWASs) have been successful in detecting variants correlated with phenotypes of clinical interest. However, the power to detect these variants depends on the number of individuals whose phenotypes are collected, and for phenotypes that are difficult to collect, the sample size might be insufficient to achieve the desired statistical power. The phenotype of interest is often difficult to collect, whereas surrogate phenotypes or related phenotypes are easier to collect and have already been collected in very large samples. This paper demonstrates how we take advantage of these additional related phenotypes to impute the phenotype of interest or target phenotype and then perform association analysis. Our approach leverages the correlation structure between phenotypes to perform the imputation. The correlation structure can be estimated from a smaller complete dataset for which both the target and related phenotypes have been collected. Under some assumptions, the statistical power can be computed analytically given the correlation structure of the phenotypes used in imputation. In addition, our method can impute the summary statistic of the target phenotype as a weighted linear combination of the summary statistics of related phenotypes. Thus, our method is applicable to datasets for which we have access only to summary statistics and not to the raw genotypes. We illustrate our approach by analyzing associated loci to triglycerides (TGs), body mass index (BMI), and systolic blood pressure (SBP) in the Northern Finland Birth Cohort dataset |
Duong, Dat ; Zou, Jennifer ; Hormozdiari, Farhad ; Sul, Jae Hoon ; Ernst, Jason ; Han, Buhm ; Eskin, Eleazar Using genomic annotations increases statistical power to detect eGenes. Journal Article In: Bioinformatics, 32 (12), pp. i156-i163, 2016, ISSN: 1367-4811. @article{Duong:Bioinformatics:2016, title = {Using genomic annotations increases statistical power to detect eGenes.}, author = {Duong, Dat and Zou, Jennifer and Hormozdiari, Farhad and Sul, Jae Hoon and Ernst, Jason and Han, Buhm and Eskin, Eleazar}, url = {http://bioinformatics.oxfordjournals.org/content/32/12/i156.abstract}, doi = {10.1093/bioinformatics/btw272}, issn = {1367-4811}, year = {2016}, date = {2016-01-01}, journal = {Bioinformatics}, volume = {32}, number = {12}, pages = {i156-i163}, address = {England}, abstract = {MOTIVATION: Expression quantitative trait loci (eQTLs) are genetic variants that affect gene expression. In eQTL studies, one important task is to find eGenes or genes whose expressions are associated with at least one eQTL. The standard statistical method to determine whether a gene is an eGene requires association testing at all nearby variants and the permutation test to correct for multiple testing. The standard method however does not consider genomic annotation of the variants. In practice, variants near gene transcription start sites (TSSs) or certain histone modifications are likely to regulate gene expression. In this article, we introduce a novel eGene detection method that considers this empirical evidence and thereby increases the statistical power. RESULTS: We applied our method to the liver Genotype-Tissue Expression (GTEx) data using distance from TSSs, DNase hypersensitivity sites, and six histone modifications as the genomic annotations for the variants. Each of these annotations helped us detected more candidate eGenes. Distance from TSS appears to be the most important annotation; specifically, using this annotation, our method discovered 50% more candidate eGenes than the standard permutation method. CONTACT: buhm.han@amc.seoul.kr or eeskin@cs.ucla.edu}, keywords = {}, pubstate = {published}, tppubtype = {article} } MOTIVATION: Expression quantitative trait loci (eQTLs) are genetic variants that affect gene expression. In eQTL studies, one important task is to find eGenes or genes whose expressions are associated with at least one eQTL. The standard statistical method to determine whether a gene is an eGene requires association testing at all nearby variants and the permutation test to correct for multiple testing. The standard method however does not consider genomic annotation of the variants. In practice, variants near gene transcription start sites (TSSs) or certain histone modifications are likely to regulate gene expression. In this article, we introduce a novel eGene detection method that considers this empirical evidence and thereby increases the statistical power. RESULTS: We applied our method to the liver Genotype-Tissue Expression (GTEx) data using distance from TSSs, DNase hypersensitivity sites, and six histone modifications as the genomic annotations for the variants. Each of these annotations helped us detected more candidate eGenes. Distance from TSS appears to be the most important annotation; specifically, using this annotation, our method discovered 50% more candidate eGenes than the standard permutation method. CONTACT: buhm.han@amc.seoul.kr or eeskin@cs.ucla.edu |
Hormozdiari, Farhad; van de Bunt, Martijn; Segrè, Ayellet V; Li, Xiao; Joo, Jong Wha J; Bilow, Michael; Sul, Jae Hoon; Sankararaman, Sriram; Pasaniuc, Bogdan; Eskin, Eleazar Colocalization of GWAS and eQTL Signals Detects Target Genes. Journal Article In: Am J Hum Genet, 2016, ISSN: 1537-6605. @article{Hormozdiari:AmJHumGenet:2016b, title = {Colocalization of GWAS and eQTL Signals Detects Target Genes.}, author = { Farhad Hormozdiari and Martijn van de Bunt and Ayellet V. Segrè and Xiao Li and Jong Wha J. Joo and Michael Bilow and Jae Hoon Sul and Sriram Sankararaman and Bogdan Pasaniuc and Eleazar Eskin}, url = {http:://dx.doi.org/10.1016/j.ajhg.2016.10.003}, issn = {1537-6605}, year = {2016}, date = {2016-01-01}, journal = {Am J Hum Genet}, address = {United States}, organization = {Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA.}, abstract = {The vast majority of genome-wide association study (GWAS) risk loci fall in non-coding regions of the genome. One possible hypothesis is that these GWAS risk loci alter the individual's disease risk through their effect on gene expression in different tissues. In order to understand the mechanisms driving a GWAS risk locus, it is helpful to determine which gene is affected in specific tissue types. For example, the relevant gene and tissue could play a role in the disease mechanism if the same variant responsible for a GWAS locus also affects gene expression. Identifying whether or not the same variant is causal in both GWASs and expression quantitative trail locus (eQTL) studies is challenging because of the uncertainty induced by linkage disequilibrium and the fact that some loci harbor multiple causal variants. However, current methods that address this problem assume that each locus contains a single causal variant. In this paper, we present eCAVIAR, a probabilistic method that has several key advantages over existing methods. First, our method can account for more than one causal variant in any given locus. Second, it can leverage summary statistics without accessing the individual genotype data. We use both simulated and real datasets to demonstrate the utility of our method. Using publicly available eQTL data on 45 different tissues, we demonstrate that eCAVIAR can prioritize likely relevant tissues and target genes for a set of glucose- and insulin-related trait loci}, keywords = {}, pubstate = {published}, tppubtype = {article} } The vast majority of genome-wide association study (GWAS) risk loci fall in non-coding regions of the genome. One possible hypothesis is that these GWAS risk loci alter the individual's disease risk through their effect on gene expression in different tissues. In order to understand the mechanisms driving a GWAS risk locus, it is helpful to determine which gene is affected in specific tissue types. For example, the relevant gene and tissue could play a role in the disease mechanism if the same variant responsible for a GWAS locus also affects gene expression. Identifying whether or not the same variant is causal in both GWASs and expression quantitative trail locus (eQTL) studies is challenging because of the uncertainty induced by linkage disequilibrium and the fact that some loci harbor multiple causal variants. However, current methods that address this problem assume that each locus contains a single causal variant. In this paper, we present eCAVIAR, a probabilistic method that has several key advantages over existing methods. First, our method can account for more than one causal variant in any given locus. Second, it can leverage summary statistics without accessing the individual genotype data. We use both simulated and real datasets to demonstrate the utility of our method. Using publicly available eQTL data on 45 different tissues, we demonstrate that eCAVIAR can prioritize likely relevant tissues and target genes for a set of glucose- and insulin-related trait loci |
Joo, Jong Wha J; Kang, Eun Yong; Org, Elin; Furlotte, Nick; Parks, Brian; Hormozdiari, Farhad; Lusis, Aldons J; Eskin, Eleazar Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure. Journal Article In: Genetics, 204 (4), pp. 1379-1390, 2016, ISSN: 1943-2631. @article{Joo:Genetics:2016, title = {Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure.}, author = { Jong Wha J. Joo and Eun Yong Kang and Elin Org and Nick Furlotte and Brian Parks and Farhad Hormozdiari and Aldons J. Lusis and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.116.189712}, issn = {1943-2631}, year = {2016}, date = {2016-01-01}, journal = {Genetics}, volume = {204}, number = {4}, pages = {1379-1390}, address = {United States}, organization = {Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, California.}, abstract = {A typical genome-wide association study tests correlation between a single phenotype and each genotype one at a time. However, single-phenotype analysis might miss unmeasured aspects of complex biological networks. Analyzing many phenotypes simultaneously may increase the power to capture these unmeasured aspects and detect more variants. Several multivariate approaches aim to detect variants related to more than one phenotype, but these current approaches do not consider the effects of population structure. As a result, these approaches may result in a significant amount of false positive identifications. Here, we introduce a new methodology, referred to as GAMMA for generalized analysis of molecular variance for mixed-model analysis, which is capable of simultaneously analyzing many phenotypes and correcting for population structure. In a simulated study using data implanted with true genetic effects, GAMMA accurately identifies these true effects without producing false positives induced by population structure. In simulations with this data, GAMMA is an improvement over other methods which either fail to detect true effects or produce many false positive identifications. We further apply our method to genetic studies of yeast and gut microbiome from mice and show that GAMMA identifies several variants that are likely to have true biological mechanisms}, keywords = {}, pubstate = {published}, tppubtype = {article} } A typical genome-wide association study tests correlation between a single phenotype and each genotype one at a time. However, single-phenotype analysis might miss unmeasured aspects of complex biological networks. Analyzing many phenotypes simultaneously may increase the power to capture these unmeasured aspects and detect more variants. Several multivariate approaches aim to detect variants related to more than one phenotype, but these current approaches do not consider the effects of population structure. As a result, these approaches may result in a significant amount of false positive identifications. Here, we introduce a new methodology, referred to as GAMMA for generalized analysis of molecular variance for mixed-model analysis, which is capable of simultaneously analyzing many phenotypes and correcting for population structure. In a simulated study using data implanted with true genetic effects, GAMMA accurately identifies these true effects without producing false positives induced by population structure. In simulations with this data, GAMMA is an improvement over other methods which either fail to detect true effects or produce many false positive identifications. We further apply our method to genetic studies of yeast and gut microbiome from mice and show that GAMMA identifies several variants that are likely to have true biological mechanisms |
Hormozdiari, Farhad; Kichaev, Gleb; Yang, Wen-Yun Y; Pasaniuc, Bogdan; Eskin, Eleazar Identification of causal genes for complex traits. Journal Article In: Bioinformatics, 31 (12), pp. i206-i213, 2015, ISSN: 1367-4811. @article{Hormozdiari:Bioinformatics:2015b, title = {Identification of causal genes for complex traits.}, author = { Farhad Hormozdiari and Gleb Kichaev and Wen-Yun Y. Yang and Bogdan Pasaniuc and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/btv240}, issn = {1367-4811}, year = {2015}, date = {2015-01-01}, journal = {Bioinformatics}, volume = {31}, number = {12}, pages = {i206-i213}, address = {England}, abstract = {MOTIVATION: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations. RESULTS: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability $rho$. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2. AVAILABILITY AND IMPLEMENTATION: Software is freely available for download at genetics.cs.ucla.edu/caviar. CONTACT: eeskin@cs.ucla.edu}, keywords = {}, pubstate = {published}, tppubtype = {article} } MOTIVATION: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations. RESULTS: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability $rho$. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2. AVAILABILITY AND IMPLEMENTATION: Software is freely available for download at genetics.cs.ucla.edu/caviar. CONTACT: eeskin@cs.ucla.edu |
Hormozdiari, Farhad; Kichaev, Gleb; Yang, Wen-Yun Y; Pasaniuc, Bogdan; Eskin, Eleazar Identification of causal genes for complex traits. Journal Article In: Bioinformatics, 31 (12), pp. i206-i213, 2015, ISSN: 1367-4811. @article{Hormozdiari:Bioinformatics:2015, title = {Identification of causal genes for complex traits.}, author = {Farhad Hormozdiari and Gleb Kichaev and Wen-Yun Y. Yang and Bogdan Pasaniuc and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/btv240}, issn = {1367-4811}, year = {2015}, date = {2015-01-01}, journal = {Bioinformatics}, volume = {31}, number = {12}, pages = {i206-i213}, address = {England}, abstract = {MOTIVATION: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations. RESULTS: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability $rho$. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2. AVAILABILITY AND IMPLEMENTATION: Software is freely available for download at genetics.cs.ucla.edu/caviar. CONTACT: eeskin@cs.ucla.edu}, keywords = {}, pubstate = {published}, tppubtype = {article} } MOTIVATION: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations. RESULTS: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability $rho$. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2. AVAILABILITY AND IMPLEMENTATION: Software is freely available for download at genetics.cs.ucla.edu/caviar. CONTACT: eeskin@cs.ucla.edu |
Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar Identifying causal variants at Loci with multiple signals of association. Journal Article In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631. @article{Hormozdiari:Genetics:2014, title = {Identifying causal variants at Loci with multiple signals of association.}, author = { Farhad Hormozdiari and Emrah Kostem and Eun Yong Kang and Bogdan Pasaniuc and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.114.167908}, issn = {1943-2631}, year = {2014}, date = {2014-01-01}, journal = {Genetics}, volume = {198}, number = {2}, pages = {497-508}, address = {United States}, abstract = {Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/}, keywords = {}, pubstate = {published}, tppubtype = {article} } Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/ |
Kichaev, Gleb; Yang, Wen-Yun Y; Lindstrom, Sara ; Hormozdiari, Farhad ; Eskin, Eleazar ; Price, Alkes L; Kraft, Peter ; Pasaniuc, Bogdan Integrating functional data to prioritize causal variants in statistical fine-mapping studies. Journal Article In: PLoS Genet, 10 (10), pp. e1004722, 2014, ISSN: 1553-7404. @article{Kichaev:PlosGenet:2014b, title = {Integrating functional data to prioritize causal variants in statistical fine-mapping studies.}, author = { Gleb Kichaev and Wen-Yun Y. Yang and Sara Lindstrom and Farhad Hormozdiari and Eleazar Eskin and Alkes L. Price and Peter Kraft and Bogdan Pasaniuc}, url = {http://dx.doi.org/10.1371/journal.pgen.1004722}, issn = {1553-7404}, year = {2014}, date = {2014-01-01}, journal = {PLoS Genet}, volume = {10}, number = {10}, pages = {e1004722}, address = {United States}, abstract = {Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data}, keywords = {}, pubstate = {published}, tppubtype = {article} } Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data |
Darnell, Gregory; Duong, Dat ; Han, Buhm ; Eskin, Eleazar Incorporating prior information into association studies. Journal Article In: Bioinformatics, 28 (12), pp. i147-i153, 2012, ISSN: 1367-4811. @article{Darnell:Bioinformatics:2012, title = {Incorporating prior information into association studies.}, author = { Gregory Darnell and Dat Duong and Buhm Han and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/bts235}, issn = {1367-4811}, year = {2012}, date = {2012-01-01}, journal = {Bioinformatics}, volume = {28}, number = {12}, pages = {i147-i153}, address = {England}, organization = {Department of Computer Science, University of California, Los Angeles, CA 90095, Department of Statistics, University of California, Berkeley, CA 94720 and Department of Human Genetics, University of }, abstract = {SUMMARY: Recent technological developments in measuring genetic variation have ushered in an era of genome-wide association studies which have discovered many genes involved in human disease. Current methods to perform association studies collect genetic information and compare the frequency of variants in individuals with and without the disease. Standard approaches do not take into account any information on whether or not a given variant is likely to have an effect on the disease. We propose a novel method for computing an association statistic which takes into account prior information. Our method improves both power and resolution by 8% and 27%, respectively, over traditional methods for performing association studies when applied to simulations using the HapMap data. Advantages of our method are that it is as simple to apply to association studies as standard methods, the results of the method are interpretable as the method reports p-values, and the method is optimal in its use of prior information in regards to statistical power. AVAILABILITY: The method presented herein is available at http://masa.cs.ucla.edu CONTACT: eeskin@cs.ucla.edu.}, keywords = {}, pubstate = {published}, tppubtype = {article} } SUMMARY: Recent technological developments in measuring genetic variation have ushered in an era of genome-wide association studies which have discovered many genes involved in human disease. Current methods to perform association studies collect genetic information and compare the frequency of variants in individuals with and without the disease. Standard approaches do not take into account any information on whether or not a given variant is likely to have an effect on the disease. We propose a novel method for computing an association statistic which takes into account prior information. Our method improves both power and resolution by 8% and 27%, respectively, over traditional methods for performing association studies when applied to simulations using the HapMap data. Advantages of our method are that it is as simple to apply to association studies as standard methods, the results of the method are interpretable as the method reports p-values, and the method is optimal in its use of prior information in regards to statistical power. AVAILABILITY: The method presented herein is available at http://masa.cs.ucla.edu CONTACT: eeskin@cs.ucla.edu. |
Flint, Jonathan; Eskin, Eleazar Genome-wide association studies in mice Journal Article In: Nature Reviews Genetics, 13 (11), pp. 807-17, 2012, ISSN: 1471-0064. @article{Flint:NatureReviewsGenetics:2012, title = {Genome-wide association studies in mice}, author = { Jonathan Flint and Eleazar Eskin}, url = {http://dx.doi.org/10.1038/nrg3335}, issn = {1471-0064}, year = {2012}, date = {2012-01-01}, journal = {Nature Reviews Genetics}, volume = {13}, number = {11}, pages = {807-17}, publisher = {Nature Publishing Group}, address = {England}, abstract = {Genome-wide association studies (GWASs) have transformed the field of human genetics and have led to the discovery of hundreds of genes that are implicated in human disease. The technological advances that drove this revolution are now poised to transform genetic studies in model organisms, including mice. However, the design of GWASs in mouse strains is fundamentally different from the design of human GWASs, creating new challenges and opportunities. This Review gives an overview of the novel study designs for mouse GWASs, which dramatically improve both the statistical power and resolution compared to classical gene-mapping approaches.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Genome-wide association studies (GWASs) have transformed the field of human genetics and have led to the discovery of hundreds of genes that are implicated in human disease. The technological advances that drove this revolution are now poised to transform genetic studies in model organisms, including mice. However, the design of GWASs in mouse strains is fundamentally different from the design of human GWASs, creating new challenges and opportunities. This Review gives an overview of the novel study designs for mouse GWASs, which dramatically improve both the statistical power and resolution compared to classical gene-mapping approaches. |
Han, Buhm; Kang, Hyun Min ; Eskin, Eleazar Rapid and accurate multiple testing correction and power estimation for millions of correlated markers. Journal Article In: PLoS Genet, 5 (4), pp. e1000456, 2009, ISSN: 1553-7404. @article{Han:PlosGenet:2009, title = {Rapid and accurate multiple testing correction and power estimation for millions of correlated markers.}, author = { Buhm Han and Hyun Min Kang and Eleazar Eskin}, url = {http://dx.doi.org/10.1371/journal.pgen.1000456}, issn = {1553-7404}, year = {2009}, date = {2009-01-01}, journal = {PLoS Genet}, volume = {5}, number = {4}, pages = {e1000456}, address = {United States}, organization = {Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, United States of America.}, abstract = {With the development of high-throughput sequencing and genotyping technologies, the number of markers collected in genetic association studies is growing rapidly, increasing the importance of methods for correcting for multiple hypothesis testing. The permutation test is widely considered the gold standard for accurate multiple testing correction, but it is often computationally impractical for these large datasets. Recently, several studies proposed efficient alternative approaches to the permutation test based on the multivariate normal distribution (MVN). However, they cannot accurately correct for multiple testing in genome-wide association studies for two reasons. First, these methods require partitioning of the genome into many disjoint blocks and ignore all correlations between markers from different blocks. Second, the true null distribution of the test statistic often fails to follow the asymptotic distribution at the tails of the distribution. We propose an accurate and efficient method for multiple testing correction in genome-wide association studies-SLIDE. Our method accounts for all correlation within a sliding window and corrects for the departure of the true null distribution of the statistic from the asymptotic distribution. In simulations using the Wellcome Trust Case Control Consortium data, the error rate of SLIDE's corrected p-values is more than 20 times smaller than the error rate of the previous MVN-based methods' corrected p-values, while SLIDE is orders of magnitude faster than the permutation test and other competing methods. We also extend the MVN framework to the problem of estimating the statistical power of an association study with correlated markers and propose an efficient and accurate power estimation method SLIP. SLIP and SLIDE are available at http://slide.cs.ucla.edu.}, keywords = {}, pubstate = {published}, tppubtype = {article} } With the development of high-throughput sequencing and genotyping technologies, the number of markers collected in genetic association studies is growing rapidly, increasing the importance of methods for correcting for multiple hypothesis testing. The permutation test is widely considered the gold standard for accurate multiple testing correction, but it is often computationally impractical for these large datasets. Recently, several studies proposed efficient alternative approaches to the permutation test based on the multivariate normal distribution (MVN). However, they cannot accurately correct for multiple testing in genome-wide association studies for two reasons. First, these methods require partitioning of the genome into many disjoint blocks and ignore all correlations between markers from different blocks. Second, the true null distribution of the test statistic often fails to follow the asymptotic distribution at the tails of the distribution. We propose an accurate and efficient method for multiple testing correction in genome-wide association studies-SLIDE. Our method accounts for all correlation within a sliding window and corrects for the departure of the true null distribution of the statistic from the asymptotic distribution. In simulations using the Wellcome Trust Case Control Consortium data, the error rate of SLIDE's corrected p-values is more than 20 times smaller than the error rate of the previous MVN-based methods' corrected p-values, while SLIDE is orders of magnitude faster than the permutation test and other competing methods. We also extend the MVN framework to the problem of estimating the statistical power of an association study with correlated markers and propose an efficient and accurate power estimation method SLIP. SLIP and SLIDE are available at http://slide.cs.ucla.edu. |
Eskin, Eleazar Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information. Journal Article In: Genome Res, 18 (4), pp. 653-60, 2008, ISSN: 1088-9051. @article{Eskin:GenomeRes:2008, title = {Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information.}, author = { Eleazar Eskin}, url = {http://dx.doi.org/10.1101/gr.072785.107}, issn = {1088-9051}, year = {2008}, date = {2008-01-01}, journal = {Genome Res}, volume = {18}, number = {4}, pages = {653-60}, address = {United States}, organization = {Department of Computer Science and Human Genetics, University of California, Los Angeles, Los Angeles, California 90095, USA. eeskin@cs.ucla.edu}, abstract = {The availability of various types of genomic data provides an opportunity to incorporate this data as prior information in genetic association studies. This information includes knowledge of linkage disequilibrium structure as well as which regions are likely to be involved in disease. In this paper, we present an approach for incorporating this information by revisiting how we perform multiple-hypothesis correction. In a traditional association study, in order to correct for multiple-hypothesis testing, the significance threshold at each marker, t, is set to control the total false-positive rate. In our framework, we vary the threshold at each marker t(i) and use these thresholds to incorporate prior information. We present a numerical procedure for solving for thresholds that maximizes association study power using prior information. We also present the results of benchmark simulation experiments using the HapMap data, which demonstrate a significant increase in association study power under this framework. We provide a Web server for performing association studies using our method and provide thresholds optimized for the Affymetrix 500 k and Illumina HumanHap 550 chips and demonstrate the application of our framework to the analysis of the Wellcome Trust Case Control Consortium data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The availability of various types of genomic data provides an opportunity to incorporate this data as prior information in genetic association studies. This information includes knowledge of linkage disequilibrium structure as well as which regions are likely to be involved in disease. In this paper, we present an approach for incorporating this information by revisiting how we perform multiple-hypothesis correction. In a traditional association study, in order to correct for multiple-hypothesis testing, the significance threshold at each marker, t, is set to control the total false-positive rate. In our framework, we vary the threshold at each marker t(i) and use these thresholds to incorporate prior information. We present a numerical procedure for solving for thresholds that maximizes association study power using prior information. We also present the results of benchmark simulation experiments using the HapMap data, which demonstrate a significant increase in association study power under this framework. We provide a Web server for performing association studies using our method and provide thresholds optimized for the Affymetrix 500 k and Illumina HumanHap 550 chips and demonstrate the application of our framework to the analysis of the Wellcome Trust Case Control Consortium data. |
Eskin, Eleazar Lecture Notes in Computer Science, 4955/2008 , Lecture Notes in Computer Science Springer Berlin / Heidelberg, 2008, ISSN: 0302-9743 (Print) 1611-3349 (Online). @conference{Eskin:LectureNotesInComputerScience:2008, title = {Increasing Power in Association Studies by Using Linkage Disequilibrium Structure and Molecular Function as Prior Information}, author = { Eleazar Eskin}, url = {http://dx.doi.org/10.1007/978-3-540-78839-3}, issn = {0302-9743 (Print) 1611-3349 (Online)}, year = {2008}, date = {2008-01-01}, booktitle = {Lecture Notes in Computer Science}, volume = {4955/2008}, pages = {434}, publisher = {Springer Berlin / Heidelberg}, series = {Lecture Notes in Computer Science}, abstract = {The availability of various types of genomic data provide an opportunity to incorporate this data as prior information in genetic association studies. This information includes knowledge of linkage disequilibrium structure as well as knowledge of which regions are likely to be involved in disease. In this paper, we present an approach for incorporating this information by revisiting how we perform multiple hypothesis correction. In a traditional association study, in order to correct for multiple hypothesis testing, the significance threshold at each marker, t, is set to control the total false positive rate. In our framework, we vary the threshold at each marker ti and use these thresholds to incorporate prior information. We present a novel Multi-threshold Association Study Analysis (MASA) method for setting these threshold to maximize the statistical power of the study in the context of the additional information. Intuitively markers which are correlated with many polymorphisms will have higher thresholds than other markers. The simplest approach for encoding prior information is through assuming a causal probability distribution. In this setting, we assume that the causal polymorphism is chosen from this distribution and only one polymorphism is causal. We refer to the probability that the polymorphism i is causal as its causal probability, ci. Given the causal probabilities, using the approach presented in this paper, we can numerically solve for the marker thresholds which maximize power. By taking advantage of this information, we show how our multi-threshold framework can significantly increase the power of association studies while still controlling the overall false positive rate, $alpha$, of the study as long as ti=$alpha$. We present a numerical procedure for solving for thresholds that maximize association study power using prior information. We present the results of benchmark simulation experiments using the HapMap data which demonstrate a significant increase in association study power under this framework. Our optimization algorithm is very efficient and we can obtain thresholds for whole genome associations in minutes. We also present an efficient permutation procedure for correctly adjusting the false positive rate for correlated markers and show how the this approach increases computational time only slightly relative to performing permutation tests for traditional association studies. We provide a webserver for performing association studies using this method at http://masa.cs.ucla.edu/. On the website, we provide thresholds optimized for the the Affymetrix 500k and Illumina HumanHap 550 chips.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } The availability of various types of genomic data provide an opportunity to incorporate this data as prior information in genetic association studies. This information includes knowledge of linkage disequilibrium structure as well as knowledge of which regions are likely to be involved in disease. In this paper, we present an approach for incorporating this information by revisiting how we perform multiple hypothesis correction. In a traditional association study, in order to correct for multiple hypothesis testing, the significance threshold at each marker, t, is set to control the total false positive rate. In our framework, we vary the threshold at each marker ti and use these thresholds to incorporate prior information. We present a novel Multi-threshold Association Study Analysis (MASA) method for setting these threshold to maximize the statistical power of the study in the context of the additional information. Intuitively markers which are correlated with many polymorphisms will have higher thresholds than other markers. The simplest approach for encoding prior information is through assuming a causal probability distribution. In this setting, we assume that the causal polymorphism is chosen from this distribution and only one polymorphism is causal. We refer to the probability that the polymorphism i is causal as its causal probability, ci. Given the causal probabilities, using the approach presented in this paper, we can numerically solve for the marker thresholds which maximize power. By taking advantage of this information, we show how our multi-threshold framework can significantly increase the power of association studies while still controlling the overall false positive rate, $alpha$, of the study as long as ti=$alpha$. We present a numerical procedure for solving for thresholds that maximize association study power using prior information. We present the results of benchmark simulation experiments using the HapMap data which demonstrate a significant increase in association study power under this framework. Our optimization algorithm is very efficient and we can obtain thresholds for whole genome associations in minutes. We also present an efficient permutation procedure for correctly adjusting the false positive rate for correlated markers and show how the this approach increases computational time only slightly relative to performing permutation tests for traditional association studies. We provide a webserver for performing association studies using this method at http://masa.cs.ucla.edu/. On the website, we provide thresholds optimized for the the Affymetrix 500k and Illumina HumanHap 550 chips. |
- Farhad Hormozdiari, Anthony Zhu, Gleb Kichaev, Chelsea J.-T. Ju, Ayellet V. Segre, Jong Wha J. Joo, Hyejung Won, Sriram Sankararaman, Bogdan Pasaniuc, Sagiv Shifman, and Eleazar Eskin. Widespread allelic heterogeneity in complex traits. The American Journal of Human Genetics, 100(5):789{802, may 2017.
- Yue Wu, Farhad Hormozdiari, Jong Wha J. Joo, and Eleazar Eskin. Improving imputation accuracy by inferring causal variants in genetic studies. In Lecture Notes in Computer Science, pages 303{317. Springer International Publishing, 2017.
The paper was written by Jose A. Lozano, Farhad Hormozdiari, Jong Wha (Joanne) Joo, Buhm Han, and Eleazar Eskin, and it is available at: https://www.biorxiv.org/content/early/2017/10/28/208199.
The full citation to our paper is:
Jose A. Lozano, Farhad Hormozdiari, Jong Wha (Joanne) Joo, Buhm Han, Eleazar Eskin. 2017. The Multivariate Normal Distribution Framework for Analyzing Association Studies. bioRxiv doi: https://doi.org/10.1101/208199.