
Multiple testing correction in linear mixed models. Journal Article In: Genome Biol, 17 (1), pp. 62, 2016, ISSN: 1474-760X. |
UCLA Computational Genetics
Joo, Jong Wha J; Hormozdiari, Farhad; Han, Buhm; Eskin, Eleazar Multiple testing correction in linear mixed models. Journal Article In: Genome Biol, 17 (1), pp. 62, 2016, ISSN: 1474-760X. @article{Joo:GenomeBiol:2016, title = {Multiple testing correction in linear mixed models.}, author = {Jong Wha J. Joo and Farhad Hormozdiari and Buhm Han and Eleazar Eskin}, url = {http://dx.doi.org/10.1186/s13059-016-0903-6}, issn = {1474-760X}, year = {2016}, date = {2016-01-01}, journal = {Genome Biol}, volume = {17}, number = {1}, pages = {62}, address = {England}, abstract = {BACKGROUND: Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM. RESULTS: We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach. CONCLUSIONS: We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data}, keywords = {}, pubstate = {published}, tppubtype = {article} } BACKGROUND: Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM. RESULTS: We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach. CONCLUSIONS: We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data |
I recently gave a talk on mixed models and confounding factors which is a long time interest of our research group at a workshop which is part of the Evolutionary Biology and the Theory of Computing program which was held at the Simons Institute on the UC Berkeley Campus. The talk was held on February 21st. This talk spans many years of work in our group including work by Hyun Min Kang (now at Michigan), Noah Zaitlen (now at UCSF), and Jimmie Ye (now at Harvard) as well as a sneak peak at very recent work by Joanne Joo, Jae-Hoon Sul and Buhm Han.
The video of the talk is available here and is also on our YouTube Channel ZarlabUCLA.
The papers which are covered in the talk include the EMMA, EMMAX and ICE papers published in 2008 as well as a very new paper that should be coming out soon. The key papers from the talk are:
Kang, Hyun Min; Sul, Jae Hoon ; Service, Susan K; Zaitlen, Noah A; Kong, Sit-Yee Y; Freimer, Nelson B; Sabatti, Chiara ; Eskin, Eleazar Variance component model to account for sample structure in genome-wide association studies. Journal Article In: Nat Genet, 42 (4), pp. 348-54, 2010, ISSN: 1546-1718. @article{Kang:NatGenet:2010, title = {Variance component model to account for sample structure in genome-wide association studies.}, author = { Hyun Min Kang and Jae Hoon Sul and Susan K. Service and Noah A. Zaitlen and Sit-Yee Y. Kong and Nelson B. Freimer and Chiara Sabatti and Eleazar Eskin}, url = {http://dx.doi.org/10.1038/ng.548}, issn = {1546-1718}, year = {2010}, date = {2010-01-01}, journal = {Nat Genet}, volume = {42}, number = {4}, pages = {348-54}, address = {United States}, organization = {Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.}, abstract = {Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure. |
Kang, Hyun Min; Zaitlen, Noah A; Wade, Claire M; Kirby, Andrew ; Heckerman, David ; Daly, Mark J; Eskin, Eleazar Efficient control of population structure in model organism association mapping. Journal Article In: Genetics, 178 (3), pp. 1709-23, 2008, ISSN: 0016-6731. @article{Kang:Genetics:2008, title = {Efficient control of population structure in model organism association mapping.}, author = { Hyun Min Kang and Noah A. Zaitlen and Claire M. Wade and Andrew Kirby and David Heckerman and Mark J. Daly and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.107.080101}, issn = {0016-6731}, year = {2008}, date = {2008-01-01}, journal = {Genetics}, volume = {178}, number = {3}, pages = {1709-23}, address = {United States}, organization = {Department of Computer Science, University of California, Los Angeles, California 90095-1596, USA.}, abstract = {Genomewide association mapping in model organisms such as inbred mouse strains is a promising approach for the identification of risk factors related to human diseases. However, genetic association studies in inbred model organisms are confronted by the problem of complex population structure among strains. This induces inflated false positive rates, which cannot be corrected using standard approaches applied in human association studies such as genomic control or structured association. Recent studies demonstrated that mixed models successfully correct for the genetic relatedness in association mapping in maize and Arabidopsis panel data sets. However, the currently available mixed-model methods suffer from computational inefficiency. In this article, we propose a new method, efficient mixed-model association (EMMA), which corrects for population structure and genetic relatedness in model organism association mapping. Our method takes advantage of the specific nature of the optimization problem in applying mixed models for association mapping, which allows us to substantially increase the computational speed and reliability of the results. We applied EMMA to in silico whole-genome association mapping of inbred mouse strains involving hundreds of thousands of SNPs, in addition to Arabidopsis and maize data sets. We also performed extensive simulation studies to estimate the statistical power of EMMA under various SNP effects, varying degrees of population structure, and differing numbers of multiple measurements per strain. Despite the limited power of inbred mouse association mapping due to the limited number of available inbred strains, we are able to identify significantly associated SNPs, which fall into known QTL or genes identified through previous studies while avoiding an inflation of false positives. An R package implementation and webserver of our EMMA method are publicly available.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Genomewide association mapping in model organisms such as inbred mouse strains is a promising approach for the identification of risk factors related to human diseases. However, genetic association studies in inbred model organisms are confronted by the problem of complex population structure among strains. This induces inflated false positive rates, which cannot be corrected using standard approaches applied in human association studies such as genomic control or structured association. Recent studies demonstrated that mixed models successfully correct for the genetic relatedness in association mapping in maize and Arabidopsis panel data sets. However, the currently available mixed-model methods suffer from computational inefficiency. In this article, we propose a new method, efficient mixed-model association (EMMA), which corrects for population structure and genetic relatedness in model organism association mapping. Our method takes advantage of the specific nature of the optimization problem in applying mixed models for association mapping, which allows us to substantially increase the computational speed and reliability of the results. We applied EMMA to in silico whole-genome association mapping of inbred mouse strains involving hundreds of thousands of SNPs, in addition to Arabidopsis and maize data sets. We also performed extensive simulation studies to estimate the statistical power of EMMA under various SNP effects, varying degrees of population structure, and differing numbers of multiple measurements per strain. Despite the limited power of inbred mouse association mapping due to the limited number of available inbred strains, we are able to identify significantly associated SNPs, which fall into known QTL or genes identified through previous studies while avoiding an inflation of false positives. An R package implementation and webserver of our EMMA method are publicly available. |
Kang, Hyun Min; Ye, Chun ; Eskin, Eleazar Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Journal Article In: Genetics, 180 (4), pp. 1909-25, 2008, ISSN: 0016-6731. @article{Kang:Genetics:2008b, title = {Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots.}, author = { Hyun Min Kang and Chun Ye and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.108.094201}, issn = {0016-6731}, year = {2008}, date = {2008-01-01}, journal = {Genetics}, volume = {180}, number = {4}, pages = {1909-25}, address = {United States}, organization = {Department of Human Genetics, University of California, Los Angeles, California 90095, USA.}, abstract = {In genomewide mapping of expression quantitative trait loci (eQTL), it is widely believed that thousands of genes are trans-regulated by a small number of genomic regions called "regulatory hotspots," resulting in "trans-regulatory bands" in an eQTL map. As several recent studies have demonstrated, technical confounding factors such as batch effects can complicate eQTL analysis by causing many spurious associations including spurious regulatory hotspots. Yet little is understood about how these technical confounding factors affect eQTL analyses and how to correct for these factors. Our analysis of data sets with biological replicates suggests that it is this intersample correlation structure inherent in expression data that leads to spurious associations between genetic loci and a large number of transcripts inducing spurious regulatory hotspots. We propose a statistical method that corrects for the spurious associations caused by complex intersample correlation of expression measurements in eQTL mapping. Applying our intersample correlation emended (ICE) eQTL mapping method to mouse, yeast, and human identifies many more cis associations while eliminating most of the spurious trans associations. The concordances of cis and trans associations have consistently increased between different replicates, tissues, and populations, demonstrating the higher accuracy of our method to identify real genetic effects.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In genomewide mapping of expression quantitative trait loci (eQTL), it is widely believed that thousands of genes are trans-regulated by a small number of genomic regions called "regulatory hotspots," resulting in "trans-regulatory bands" in an eQTL map. As several recent studies have demonstrated, technical confounding factors such as batch effects can complicate eQTL analysis by causing many spurious associations including spurious regulatory hotspots. Yet little is understood about how these technical confounding factors affect eQTL analyses and how to correct for these factors. Our analysis of data sets with biological replicates suggests that it is this intersample correlation structure inherent in expression data that leads to spurious associations between genetic loci and a large number of transcripts inducing spurious regulatory hotspots. We propose a statistical method that corrects for the spurious associations caused by complex intersample correlation of expression measurements in eQTL mapping. Applying our intersample correlation emended (ICE) eQTL mapping method to mouse, yeast, and human identifies many more cis associations while eliminating most of the spurious trans associations. The concordances of cis and trans associations have consistently increased between different replicates, tissues, and populations, demonstrating the higher accuracy of our method to identify real genetic effects. |
Emrah Kostem, who graduated this year and is now at Illumina, gave a talk about the research he completed in the lab this summer at our retreat. It is available here and gives a good overview of what the goals of our group are and some details of the projects that Emrah completed in the lab.
One of the topics he discusses is his recently published work on estimating heritability, which is quantifying the amount that genetics accounts for the variance of a trait. He discusses his work on how to partition heritability into the contributions of genomic regions(10.1016/j.ajhg.2013.03.010).
He also talks about his work which takes advantage of the insight that association statistics follow the multivariate normal distribution and applies this to two problems. The first is the problem of selecting follow up SNPs using the results of an association study(10.1534/genetics.111.128595). The second problem is the problem of speeding up eQTL studies using a two stage approach where only a fraction of the association tests are performed but virtually all of the significant associations are still discovered(10.1089/cmb.2013.0087).
Details of what he talked about are in his papers:
Kostem, Emrah; Eskin, Eleazar Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Journal Article In: Am J Hum Genet, 92 (4), pp. 558-64, 2013, ISSN: 1537-6605. @article{Kostem:AmJHumGenet:2013, title = {Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions.}, author = { Emrah Kostem and Eleazar Eskin}, url = {http://dx.doi.org/10.1016/j.ajhg.2013.03.010}, issn = {1537-6605}, year = {2013}, date = {2013-01-01}, journal = {Am J Hum Genet}, volume = {92}, number = {4}, pages = {558-64}, address = {United States}, organization = {Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA. Electronic address: ekostem@cs.ucla.edu.}, abstract = {Quantifying heritability, the amount of genetic contribution in a complex trait, has been of fundamental interest to geneticists for decades. Recently, partitioning the heritability accounted for by common variants into the contributions of genomic regions has received a lot of attention given its important applications for understanding the genetic architecture of complex traits. Current methods partition the total heritability by jointly estimating the contributions of all regions. However, these methods are computationally intractable and can be inaccurate when the number of regions is large. In this paper, we present an alternative approach that partitions the total heritability into the contributions of an arbitrary number of regions. We demonstrate by using simulations that our approach is more accurate and computationally efficient than current approaches. Using a data set from a genome-wide association study on human height, we demonstrate the utility of our method by estimating the heritability contributions of chromosomes and subchromosomal regions}, keywords = {}, pubstate = {published}, tppubtype = {article} } Quantifying heritability, the amount of genetic contribution in a complex trait, has been of fundamental interest to geneticists for decades. Recently, partitioning the heritability accounted for by common variants into the contributions of genomic regions has received a lot of attention given its important applications for understanding the genetic architecture of complex traits. Current methods partition the total heritability by jointly estimating the contributions of all regions. However, these methods are computationally intractable and can be inaccurate when the number of regions is large. In this paper, we present an alternative approach that partitions the total heritability into the contributions of an arbitrary number of regions. We demonstrate by using simulations that our approach is more accurate and computationally efficient than current approaches. Using a data set from a genome-wide association study on human height, we demonstrate the utility of our method by estimating the heritability contributions of chromosomes and subchromosomal regions |
Kostem, Emrah; Eskin, Eleazar Efficiently Identifying Significant Associations in Genome-wide Association Studies. Journal Article In: J Comput Biol, 20 (10), pp. 817-30, 2013, ISSN: 1557-8666. @article{Kostem:JComputBiol:2013, title = {Efficiently Identifying Significant Associations in Genome-wide Association Studies.}, author = {Emrah Kostem and Eleazar Eskin}, url = {http://dx.doi.org/10.1089/cmb.2013.0087}, issn = {1557-8666}, year = {2013}, date = {2013-01-01}, journal = {J Comput Biol}, volume = {20}, number = {10}, pages = {817-30}, address = {United States}, organization = {1 Computer Science Department, University of California , Los Angeles, California.}, abstract = {Abstract Over the past several years, genome-wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome that harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits, where only a handful of phenotypes are analyzed per study, in eQTL studies, tens of thousands of gene expression levels are measured, and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the single nucleotide polymorphisms (SNPs). In the first stage, a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions that may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to the state-of-the-art testing approaches by a factor of 75}, keywords = {}, pubstate = {published}, tppubtype = {article} } Abstract Over the past several years, genome-wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome that harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits, where only a handful of phenotypes are analyzed per study, in eQTL studies, tens of thousands of gene expression levels are measured, and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the single nucleotide polymorphisms (SNPs). In the first stage, a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions that may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to the state-of-the-art testing approaches by a factor of 75 |
Kostem, Emrah; Lozano, Jose A; Eskin, Eleazar Increasing Power of Genome-wide Association Studies by Collecting Additional SNPs. Journal Article In: Genetics, 2011, ISSN: 1943-2631. @article{Kostem:Genetics:2011, title = {Increasing Power of Genome-wide Association Studies by Collecting Additional SNPs.}, author = { Emrah Kostem and Jose A. Lozano and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.111.128595}, issn = {1943-2631}, year = {2011}, date = {2011-01-01}, journal = {Genetics}, organization = {University of California, Los Angeles;}, abstract = {Genome-wide association studies (GWAS) have been effectively identifying the genomic regions associated with a disease trait. In a typical GWAS, an informative subset of the single nucleotide polymorphisms (SNPs), called tag SNPs, are genotyped in case-control individuals. Once the tag SNP statistics are computed, the genomic regions that are in linkage disequilibrium (LD) with the most significantly associated tag SNPs are believed to contain the causal polymorphisms. However, such LD regions are often large and contain many additional polymorphisms. Following up all the SNPs included in these regions is costly and infeasible for biological validation. In this paper we address how to characterize these regions cost-effectively with the goal of providing investigators a clear direction for biological validation. We introduce a follow-up study approach for identifying all untyped associated SNPs by selecting additional SNPs, called follow-up SNPs, from the associated regions and genotyping them in the original case-control individuals. We introduce a novel SNP selection method with the goal of maximizing the number of associated SNPs among the chosen follow-up SNPs. We show how the observed statistics of the original tag SNPs and human genetic variation reference data such as the HapMap Pro ject can be utilized to identify the follow-up SNPs. We use simulated and real association studies based on the HapMap data and the Wellcome Trust Case-Control Consortium to demonstrate that our method shows superior performance than the correlation and distance based traditional follow-up SNP selection approaches. Our method is publicly available at http://genetics.cs.ucla.edu/followupSNPs.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Genome-wide association studies (GWAS) have been effectively identifying the genomic regions associated with a disease trait. In a typical GWAS, an informative subset of the single nucleotide polymorphisms (SNPs), called tag SNPs, are genotyped in case-control individuals. Once the tag SNP statistics are computed, the genomic regions that are in linkage disequilibrium (LD) with the most significantly associated tag SNPs are believed to contain the causal polymorphisms. However, such LD regions are often large and contain many additional polymorphisms. Following up all the SNPs included in these regions is costly and infeasible for biological validation. In this paper we address how to characterize these regions cost-effectively with the goal of providing investigators a clear direction for biological validation. We introduce a follow-up study approach for identifying all untyped associated SNPs by selecting additional SNPs, called follow-up SNPs, from the associated regions and genotyping them in the original case-control individuals. We introduce a novel SNP selection method with the goal of maximizing the number of associated SNPs among the chosen follow-up SNPs. We show how the observed statistics of the original tag SNPs and human genetic variation reference data such as the HapMap Pro ject can be utilized to identify the follow-up SNPs. We use simulated and real association studies based on the HapMap data and the Wellcome Trust Case-Control Consortium to demonstrate that our method shows superior performance than the correlation and distance based traditional follow-up SNP selection approaches. Our method is publicly available at http://genetics.cs.ucla.edu/followupSNPs. |