Golden Helix yesterday hosted a excellent webcast on correcting population structure in association studies using mixed models and they highlighted our EMMA(10.1534/genetics.107.080101) and EMMAX(10.1038/ng.548) algorithms. The webcast was given by Greta Peterson and is available at http://www.goldenhelix.com/Events/recordings/mlm/index.html. The webcast is a great overview of mixed models applied to population structure in general as well as specifically how to use the Golden Helix software to use mixed models in association studies.
A interesting aspect of the story is that we found out about the webcast from an email advertising that they will cover the EMMAX algorithm. It turns out that there were 863 people registered for the webcast which surpassed their previous record (for a webcast on NGS) by almost 100! It is exciting to see how much interest there is in mixed models and in our EMMA paper which we published in 2008.
On our website, we have a bunch of resources for mixed models including the EMMA, EMMAX and ICE softwares. We recently posted an overview of mixed models here. Below is a list of our papers related to mixed models.
1. | Yao, Douglas W; Balanis, Nikolas G; Eskin, Eleazar; Graeber, Thomas G A linear mixed model approach to gene expression-tumor aneuploidy association studies Journal Article In: Sci. Rep., 9 (1), pp. 11944, 2019, ISSN: 2045-2322. @article{Yao2019-zq, title = {A linear mixed model approach to gene expression-tumor aneuploidy association studies}, author = {Douglas W Yao and Nikolas G Balanis and Eleazar Eskin and Thomas G Graeber}, url = {http://dx.doi.org/10.1038/s41598-019-48302-1}, doi = {10.1038/s41598-019-48302-1}, issn = {2045-2322}, year = {2019}, date = {2019-08-01}, journal = {Sci. Rep.}, volume = {9}, number = {1}, pages = {11944}, address = {England}, abstract = {Aneuploidy, defined as abnormal chromosome number or somatic DNA copy number, is a characteristic of many aggressive tumors and is thought to drive tumorigenesis. Gene expression-aneuploidy association studies have previously been conducted to explore cellular mechanisms associated with aneuploidy. However, in an observational setting, gene expression is influenced by many factors that can act as confounders between gene expression and aneuploidy, leading to spurious correlations between the two variables. These factors include known confounders such as sample purity or batch effect, as well as gene co-regulation which induces correlations between the expression of causal genes and non-causal genes. We use a linear mixed-effects model (LMM) to account for confounding effects of tumor purity and gene co-regulation on gene expression-aneuploidy associations. When applied to patient tumor data across diverse tumor types, we observe that the LMM both accounts for the impact of purity on aneuploidy measurements and identifies a new association between histone gene expression and aneuploidy.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Aneuploidy, defined as abnormal chromosome number or somatic DNA copy number, is a characteristic of many aggressive tumors and is thought to drive tumorigenesis. Gene expression-aneuploidy association studies have previously been conducted to explore cellular mechanisms associated with aneuploidy. However, in an observational setting, gene expression is influenced by many factors that can act as confounders between gene expression and aneuploidy, leading to spurious correlations between the two variables. These factors include known confounders such as sample purity or batch effect, as well as gene co-regulation which induces correlations between the expression of causal genes and non-causal genes. We use a linear mixed-effects model (LMM) to account for confounding effects of tumor purity and gene co-regulation on gene expression-aneuploidy associations. When applied to patient tumor data across diverse tumor types, we observe that the LMM both accounts for the impact of purity on aneuploidy measurements and identifies a new association between histone gene expression and aneuploidy. |
2. | Sul, Jae Hoon; Martin, Lana S; Eskin, Eleazar Population structure in genetic studies: Confounding factors and mixed models. Journal Article In: PLoS Genet, 14 (12), pp. e1007309, 2018, ISSN: 1553-7404. @article{Sul:PlosGenet:2018, title = {Population structure in genetic studies: Confounding factors and mixed models.}, author = { Jae Hoon Sul and Lana S. Martin and Eleazar Eskin}, url = {http://dx.doi.org/10.1371/journal.pgen.1007309}, issn = {1553-7404}, year = {2018}, date = {2018-01-01}, journal = {PLoS Genet}, volume = {14}, number = {12}, pages = {e1007309}, address = {United States}, organization = {Department of Psychiatry and Biobehavioral Sciences, University of California Los Angeles, Los Angeles, California, United States of America.}, abstract = {A genome-wide association study (GWAS) seeks to identify genetic variants that contribute to the development and progression of a specific disease. Over the past 10 years, new approaches using mixed models have emerged to mitigate the deleterious effects of population structure and relatedness in association studies. However, developing GWAS techniques to accurately test for association while correcting for population structure is a computational and statistical challenge. Using laboratory mouse strains as an example, our review characterizes the problem of population structure in association studies and describes how it can cause false positive associations. We then motivate mixed models in the context of unmodeled factors}, keywords = {}, pubstate = {published}, tppubtype = {article} } A genome-wide association study (GWAS) seeks to identify genetic variants that contribute to the development and progression of a specific disease. Over the past 10 years, new approaches using mixed models have emerged to mitigate the deleterious effects of population structure and relatedness in association studies. However, developing GWAS techniques to accurately test for association while correcting for population structure is a computational and statistical challenge. Using laboratory mouse strains as an example, our review characterizes the problem of population structure in association studies and describes how it can cause false positive associations. We then motivate mixed models in the context of unmodeled factors |
3. | Sul, Jae Hoon; Bilow, Michael; Yang, Wen-Yun Y; Kostem, Emrah; Furlotte, Nick; He, Dan; Eskin, Eleazar Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models. Journal Article In: PLoS Genet, 12 (3), pp. e1005849, 2016, ISSN: 1553-7404. @article{Sul:PlosGenet:2016, title = {Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models.}, author = {Jae Hoon Sul and Michael Bilow and Wen-Yun Y. Yang and Emrah Kostem and Nick Furlotte and Dan He and Eleazar Eskin}, url = {http://dx.doi.org/10.1371/journal.pgen.1005849}, issn = {1553-7404}, year = {2016}, date = {2016-01-01}, journal = {PLoS Genet}, volume = {12}, number = {3}, pages = {e1005849}, address = {United States}, abstract = {Although genome-wide association studies (GWASs) have discovered numerous novel genetic variants associated with many complex traits and diseases, those genetic variants typically explain only a small fraction of phenotypic variance. Factors that account for phenotypic variance include environmental factors and gene-by-environment interactions (GEIs). Recently, several studies have conducted genome-wide gene-by-environment association analyses and demonstrated important roles of GEIs in complex traits. One of the main challenges in these association studies is to control effects of population structure that may cause spurious associations. Many studies have analyzed how population structure influences statistics of genetic variants and developed several statistical approaches to correct for population structure. However, the impact of population structure on GEI statistics in GWASs has not been extensively studied and nor have there been methods designed to correct for population structure on GEI statistics. In this paper, we show both analytically and empirically that population structure may cause spurious GEIs and use both simulation and two GWAS datasets to support our finding. We propose a statistical approach based on mixed models to account for population structure on GEI statistics. We find that our approach effectively controls population structure on statistics for GEIs as well as for genetic variants}, keywords = {}, pubstate = {published}, tppubtype = {article} } Although genome-wide association studies (GWASs) have discovered numerous novel genetic variants associated with many complex traits and diseases, those genetic variants typically explain only a small fraction of phenotypic variance. Factors that account for phenotypic variance include environmental factors and gene-by-environment interactions (GEIs). Recently, several studies have conducted genome-wide gene-by-environment association analyses and demonstrated important roles of GEIs in complex traits. One of the main challenges in these association studies is to control effects of population structure that may cause spurious associations. Many studies have analyzed how population structure influences statistics of genetic variants and developed several statistical approaches to correct for population structure. However, the impact of population structure on GEI statistics in GWASs has not been extensively studied and nor have there been methods designed to correct for population structure on GEI statistics. In this paper, we show both analytically and empirically that population structure may cause spurious GEIs and use both simulation and two GWAS datasets to support our finding. We propose a statistical approach based on mixed models to account for population structure on GEI statistics. We find that our approach effectively controls population structure on statistics for GEIs as well as for genetic variants |
4. | Joo, Jong Wha J; Hormozdiari, Farhad; Han, Buhm; Eskin, Eleazar Multiple testing correction in linear mixed models. Journal Article In: Genome Biol, 17 (1), pp. 62, 2016, ISSN: 1474-760X. @article{Joo:GenomeBiol:2016, title = {Multiple testing correction in linear mixed models.}, author = {Jong Wha J. Joo and Farhad Hormozdiari and Buhm Han and Eleazar Eskin}, url = {http://dx.doi.org/10.1186/s13059-016-0903-6}, issn = {1474-760X}, year = {2016}, date = {2016-01-01}, journal = {Genome Biol}, volume = {17}, number = {1}, pages = {62}, address = {England}, abstract = {BACKGROUND: Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM. RESULTS: We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach. CONCLUSIONS: We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data}, keywords = {}, pubstate = {published}, tppubtype = {article} } BACKGROUND: Multiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM. RESULTS: We were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach. CONCLUSIONS: We provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data |
5. | Joo, Jong Wha J; Kang, Eun Yong; Org, Elin; Furlotte, Nick; Parks, Brian; Hormozdiari, Farhad; Lusis, Aldons J; Eskin, Eleazar Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure. Journal Article In: Genetics, 204 (4), pp. 1379-1390, 2016, ISSN: 1943-2631. @article{Joo:Genetics:2016, title = {Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure.}, author = { Jong Wha J. Joo and Eun Yong Kang and Elin Org and Nick Furlotte and Brian Parks and Farhad Hormozdiari and Aldons J. Lusis and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.116.189712}, issn = {1943-2631}, year = {2016}, date = {2016-01-01}, journal = {Genetics}, volume = {204}, number = {4}, pages = {1379-1390}, address = {United States}, organization = {Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, California.}, abstract = {A typical genome-wide association study tests correlation between a single phenotype and each genotype one at a time. However, single-phenotype analysis might miss unmeasured aspects of complex biological networks. Analyzing many phenotypes simultaneously may increase the power to capture these unmeasured aspects and detect more variants. Several multivariate approaches aim to detect variants related to more than one phenotype, but these current approaches do not consider the effects of population structure. As a result, these approaches may result in a significant amount of false positive identifications. Here, we introduce a new methodology, referred to as GAMMA for generalized analysis of molecular variance for mixed-model analysis, which is capable of simultaneously analyzing many phenotypes and correcting for population structure. In a simulated study using data implanted with true genetic effects, GAMMA accurately identifies these true effects without producing false positives induced by population structure. In simulations with this data, GAMMA is an improvement over other methods which either fail to detect true effects or produce many false positive identifications. We further apply our method to genetic studies of yeast and gut microbiome from mice and show that GAMMA identifies several variants that are likely to have true biological mechanisms}, keywords = {}, pubstate = {published}, tppubtype = {article} } A typical genome-wide association study tests correlation between a single phenotype and each genotype one at a time. However, single-phenotype analysis might miss unmeasured aspects of complex biological networks. Analyzing many phenotypes simultaneously may increase the power to capture these unmeasured aspects and detect more variants. Several multivariate approaches aim to detect variants related to more than one phenotype, but these current approaches do not consider the effects of population structure. As a result, these approaches may result in a significant amount of false positive identifications. Here, we introduce a new methodology, referred to as GAMMA for generalized analysis of molecular variance for mixed-model analysis, which is capable of simultaneously analyzing many phenotypes and correcting for population structure. In a simulated study using data implanted with true genetic effects, GAMMA accurately identifies these true effects without producing false positives induced by population structure. In simulations with this data, GAMMA is an improvement over other methods which either fail to detect true effects or produce many false positive identifications. We further apply our method to genetic studies of yeast and gut microbiome from mice and show that GAMMA identifies several variants that are likely to have true biological mechanisms |
6. | Schweiger, Regev; Kaufman, Shachar; Laaksonen, Reijo; Kleber, Marcus E; März, Winfried; Eskin, Eleazar; Rosset, Saharon; Halperin, Eran Fast and Accurate Construction of Confidence Intervals for Heritability. Journal Article In: Am J Hum Genet, 98 (6), pp. 1181-92, 2016, ISSN: 1537-6605. @article{Schweiger:AmJHumGenet:2016, title = {Fast and Accurate Construction of Confidence Intervals for Heritability.}, author = { Regev Schweiger and Shachar Kaufman and Reijo Laaksonen and Marcus E. Kleber and Winfried März and Eleazar Eskin and Saharon Rosset and Eran Halperin}, url = {http://dx.doi.org/10.1016/j.ajhg.2016.04.016}, issn = {1537-6605}, year = {2016}, date = {2016-01-01}, journal = {Am J Hum Genet}, volume = {98}, number = {6}, pages = {1181-92}, address = {United States}, abstract = {Estimation of heritability is fundamental in genetic studies. Recently, heritability estimation using linear mixed models (LMMs) has gained popularity because these estimates can be obtained from unrelated individuals collected in genome-wide association studies. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. Existing methods for the construction of confidence intervals and estimators of SEs for REML rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals. Here, we show that the estimation of confidence intervals by state-of-the-art methods is inaccurate, especially when the true heritability is relatively low or relatively high. We further show that these inaccuracies occur in datasets including thousands of individuals. Such biases are present, for example, in estimates of heritability of gene expression in the Genotype-Tissue Expression project and of lipid profiles in the Ludwigshafen Risk and Cardiovascular Health study. We also show that often the probability that the genetic component is estimated as 0 is high even when the true heritability is bounded away from 0, emphasizing the need for accurate confidence intervals. We propose a computationally efficient method, ALBI (accurate LMM-based heritability bootstrap confidence intervals), for estimating the distribution of the heritability estimator and for constructing accurate confidence intervals. Our method can be used as an add-on to existing methods for estimating heritability and variance components, such as GCTA, FaST-LMM, GEMMA, or EMMAX}, keywords = {}, pubstate = {published}, tppubtype = {article} } Estimation of heritability is fundamental in genetic studies. Recently, heritability estimation using linear mixed models (LMMs) has gained popularity because these estimates can be obtained from unrelated individuals collected in genome-wide association studies. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. Existing methods for the construction of confidence intervals and estimators of SEs for REML rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals. Here, we show that the estimation of confidence intervals by state-of-the-art methods is inaccurate, especially when the true heritability is relatively low or relatively high. We further show that these inaccuracies occur in datasets including thousands of individuals. Such biases are present, for example, in estimates of heritability of gene expression in the Genotype-Tissue Expression project and of lipid profiles in the Ludwigshafen Risk and Cardiovascular Health study. We also show that often the probability that the genetic component is estimated as 0 is high even when the true heritability is bounded away from 0, emphasizing the need for accurate confidence intervals. We propose a computationally efficient method, ALBI (accurate LMM-based heritability bootstrap confidence intervals), for estimating the distribution of the heritability estimator and for constructing accurate confidence intervals. Our method can be used as an add-on to existing methods for estimating heritability and variance components, such as GCTA, FaST-LMM, GEMMA, or EMMAX |
7. | Furlotte, Nicholas A; Eskin, Eleazar Efficient Multiple Trait Association and Estimation of Genetic Correlation Using the Matrix-Variate Linear Mixed-Model. Journal Article In: Genetics, 200 (1), pp. 59-68, 2015, ISSN: 1943-2631. @article{Furlotte:Genetics:2015b, title = {Efficient Multiple Trait Association and Estimation of Genetic Correlation Using the Matrix-Variate Linear Mixed-Model.}, author = { Nicholas A. Furlotte and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.114.171447}, issn = {1943-2631}, year = {2015}, date = {2015-01-01}, journal = {Genetics}, volume = {200}, number = {1}, pages = {59-68}, address = {United States}, abstract = {Multiple trait association mapping, in which multiple traits are used simultaneously in the identification of genetic variants affecting those traits, has recently attracted interest. One class of approaches for this problem builds on classical variance component methodology, utilizing a multi-trait version of a linear mixed-model. These approaches both increase power and provide insights into the genetic architecture of multiple traits. In particular, it is possible to estimate the genetic correlation which is a measure of the portion of the total correlation between traits that is due to additive genetic effects. Unfortunately, the practical utility of these methods is limited since they are computationally intractable for large sample sizes. In this paper, we introduce a reformulation of the multiple trait association mapping approach by defining the matrix-variate linear mixed model. Our approach reduces the computational time necessary to perform maximum-likelihood inference in a multiple trait model by utilizing a data transformation. By utilizing a well-studied human cohort, we show that our approach provides more than a 10-fold speed up, making multiple trait association feasible in a large population cohort on the genome-wide scale. We take advantage of the efficiency of our approach to analyze gene expression data. By decomposing gene coexpression into a genetic and environmental component, we show that our method provides fundamental insights into the nature of co-expressed genes. An implementation of this method is available at http://genetics.cs.ucla.edu/mvLMM}, keywords = {}, pubstate = {published}, tppubtype = {article} } Multiple trait association mapping, in which multiple traits are used simultaneously in the identification of genetic variants affecting those traits, has recently attracted interest. One class of approaches for this problem builds on classical variance component methodology, utilizing a multi-trait version of a linear mixed-model. These approaches both increase power and provide insights into the genetic architecture of multiple traits. In particular, it is possible to estimate the genetic correlation which is a measure of the portion of the total correlation between traits that is due to additive genetic effects. Unfortunately, the practical utility of these methods is limited since they are computationally intractable for large sample sizes. In this paper, we introduce a reformulation of the multiple trait association mapping approach by defining the matrix-variate linear mixed model. Our approach reduces the computational time necessary to perform maximum-likelihood inference in a multiple trait model by utilizing a data transformation. By utilizing a well-studied human cohort, we show that our approach provides more than a 10-fold speed up, making multiple trait association feasible in a large population cohort on the genome-wide scale. We take advantage of the efficiency of our approach to analyze gene expression data. By decomposing gene coexpression into a genetic and environmental component, we show that our method provides fundamental insights into the nature of co-expressed genes. An implementation of this method is available at http://genetics.cs.ucla.edu/mvLMM |
8. | Kostem, Emrah; Eskin, Eleazar Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Journal Article In: Am J Hum Genet, 92 (4), pp. 558-64, 2013, ISSN: 1537-6605. @article{Kostem:AmJHumGenet:2013, title = {Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions.}, author = { Emrah Kostem and Eleazar Eskin}, url = {http://dx.doi.org/10.1016/j.ajhg.2013.03.010}, issn = {1537-6605}, year = {2013}, date = {2013-01-01}, journal = {Am J Hum Genet}, volume = {92}, number = {4}, pages = {558-64}, address = {United States}, organization = {Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA. Electronic address: ekostem@cs.ucla.edu.}, abstract = {Quantifying heritability, the amount of genetic contribution in a complex trait, has been of fundamental interest to geneticists for decades. Recently, partitioning the heritability accounted for by common variants into the contributions of genomic regions has received a lot of attention given its important applications for understanding the genetic architecture of complex traits. Current methods partition the total heritability by jointly estimating the contributions of all regions. However, these methods are computationally intractable and can be inaccurate when the number of regions is large. In this paper, we present an alternative approach that partitions the total heritability into the contributions of an arbitrary number of regions. We demonstrate by using simulations that our approach is more accurate and computationally efficient than current approaches. Using a data set from a genome-wide association study on human height, we demonstrate the utility of our method by estimating the heritability contributions of chromosomes and subchromosomal regions}, keywords = {}, pubstate = {published}, tppubtype = {article} } Quantifying heritability, the amount of genetic contribution in a complex trait, has been of fundamental interest to geneticists for decades. Recently, partitioning the heritability accounted for by common variants into the contributions of genomic regions has received a lot of attention given its important applications for understanding the genetic architecture of complex traits. Current methods partition the total heritability by jointly estimating the contributions of all regions. However, these methods are computationally intractable and can be inaccurate when the number of regions is large. In this paper, we present an alternative approach that partitions the total heritability into the contributions of an arbitrary number of regions. We demonstrate by using simulations that our approach is more accurate and computationally efficient than current approaches. Using a data set from a genome-wide association study on human height, we demonstrate the utility of our method by estimating the heritability contributions of chromosomes and subchromosomal regions |
9. | Sul, Jae Hoon; Eskin, Eleazar Mixed models can correct for population structure for genomic regions under selection. Journal Article In: Nat Rev Genet, 2013, ISSN: 1471-0064. @article{Sul:NatRevGenet:2013, title = {Mixed models can correct for population structure for genomic regions under selection.}, author = { Jae Hoon Sul and Eleazar Eskin}, url = {http://dx.doi.org/10.1038/nrg2813-c1}, issn = {1471-0064}, year = {2013}, date = {2013-01-01}, journal = {Nat Rev Genet}, organization = {Computer Science Department, University of California, Los Angeles, California 90095, USA.}, keywords = {}, pubstate = {published}, tppubtype = {article} } |
10. | Sul, Jae Hoon; Han, Buhm ; Ye, Chun ; Choi, Ted ; Eskin, Eleazar Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Meta-analytic Approaches Journal Article In: PLoS Genet, 9 (6), pp. e1003491, 2013, ISSN: 1553-7404. @article{10.1371/journal.pgen.1003491, title = {Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Meta-analytic Approaches}, author = { Jae Hoon Sul and Buhm Han and Chun Ye and Ted Choi and Eleazar Eskin}, url = {http://dx.doi.org/10.1371%2Fjournal.pgen.1003491}, issn = {1553-7404}, year = {2013}, date = {2013-01-01}, journal = {PLoS Genet}, volume = {9}, number = {6}, pages = {e1003491}, publisher = {Public Library of Science}, address = {United States}, abstract = {Author Summary The combination of gene expression and genetic variation data has enabled the identification of genetic variants that affect gene expression levels. It has been shown that some variants influence gene expression in only one tissue while others influence gene expression in multiple tissues. However, an analysis of multiple tissue data using traditional statistical methods typically fails to identify those variants that affect multiple tissues because each tissue is treated independently and due to low statistical power, the effect in a given tissue may be missed. Building on recent advances in statistical methods for meta-analysis and mixed models, we present a novel method that combines information from multiple tissues to identify genetic variation that affects multiple tissues. We show that our method detects more genetic variation that influences multiple tissues than traditional statistical methods both on simulated and real data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Author Summary The combination of gene expression and genetic variation data has enabled the identification of genetic variants that affect gene expression levels. It has been shown that some variants influence gene expression in only one tissue while others influence gene expression in multiple tissues. However, an analysis of multiple tissue data using traditional statistical methods typically fails to identify those variants that affect multiple tissues because each tissue is treated independently and due to low statistical power, the effect in a given tissue may be missed. Building on recent advances in statistical methods for meta-analysis and mixed models, we present a novel method that combines information from multiple tissues to identify genetic variation that affects multiple tissues. We show that our method detects more genetic variation that influences multiple tissues than traditional statistical methods both on simulated and real data. |
11. | Listgarten, Jennifer; Lippert, Christoph ; Kadie, Carl M; Davidson, Robert I; Eskin, Eleazar ; Heckerman, David Improved linear mixed models for genome-wide association studies. Journal Article In: Nat Methods, 9 (6), pp. 525-6, 2012, ISSN: 1548-7105. @article{Listgarten:NatMethods:2012, title = {Improved linear mixed models for genome-wide association studies.}, author = { Jennifer Listgarten and Christoph Lippert and Carl M. Kadie and Robert I. Davidson and Eleazar Eskin and David Heckerman}, url = {http://dx.doi.org/10.1038/nmeth.2037}, issn = {1548-7105}, year = {2012}, date = {2012-01-01}, journal = {Nat Methods}, volume = {9}, number = {6}, pages = {525-6}, address = {United States}, organization = {1] Microsoft Research, Los Angeles, California, USA. [2].}, keywords = {}, pubstate = {published}, tppubtype = {article} } |
12. | Kenny, Eimear E; Kim, Minseung ; Gusev, Alexander ; Lowe, Jennifer K; Salit, Jacqueline ; Smith, Gustav J; Kovvali, Sirisha ; Kang, Hyun Min ; Newton-Cheh, Christopher ; Daly, Mark J; Stoffel, Markus ; Altshuler, David M; Friedman, Jeffrey M; Eskin, Eleazar ; Breslow, Jan L; Pe'er, Itsik Increased power of mixed models facilitates association mapping of 10 loci for metabolic traits in an isolated population. Journal Article In: Hum Mol Genet, 20 (4), pp. 827-39, 2010, ISSN: 1460-2083. @article{Kenny:HumMolGenet:2010, title = {Increased power of mixed models facilitates association mapping of 10 loci for metabolic traits in an isolated population.}, author = { Eimear E. Kenny and Minseung Kim and Alexander Gusev and Jennifer K. Lowe and Jacqueline Salit and J. Gustav Smith and Sirisha Kovvali and Hyun Min Kang and Christopher Newton-Cheh and Mark J. Daly and Markus Stoffel and David M. Altshuler and Jeffrey M. Friedman and Eleazar Eskin and Jan L. Breslow and Itsik Pe'er}, url = {http://dx.doi.org/10.1093/hmg/ddq510}, issn = {1460-2083}, year = {2010}, date = {2010-01-01}, journal = {Hum Mol Genet}, volume = {20}, number = {4}, pages = {827-39}, address = {England}, organization = {Department of Computer Science, Columbia University, 505 Computer Science Building, 1214 Amsterdam Ave.: Mailcode 0401, New York, NY 10027-7003, USA.}, abstract = {The potential benefits of using population isolates in genetic mapping, such as reduced genetic, phenotypic and environmental heterogeneity, are offset by the challenges posed by the large amounts of direct and cryptic relatedness in these populations confounding basic assumptions of independence. We have evaluated four representative specialized methods for association testing in the presence of relatedness; (i) within-family (ii) within- and between-family and (iii) mixed-models methods, using simulated traits for 2906 subjects with known genome-wide genotype data from an extremely isolated population, the Island of Kosrae, Federated States of Micronesia. We report that mixed models optimally extract association information from such samples, demonstrating 88% power to rank the true variant as among the top 10 genome-wide with 56% achieving genome-wide significance, a >80% improvement over the other methods, and demonstrate that population isolates have similar power to non-isolate populations for observing variants of known effects. We then used the mixed-model method to reanalyze data for 17 published phenotypes relating to metabolic traits and electrocardiographic measures, along with another 8 previously unreported. We replicate nine genome-wide significant associations with known loci of plasma cholesterol, high-density lipoprotein, low-density lipoprotein, triglycerides, thyroid stimulating hormone, homocysteine, C-reactive protein and uric acid, with only one detected in the previous analysis of the same traits. Further, we leveraged shared identity-by-descent genetic segments in the region of the uric acid locus to fine-map the signal, refining the known locus by a factor of 4. Finally, we report a novel associations for height (rs17629022, P< 2.1 $times$ 10(-8)).}, keywords = {}, pubstate = {published}, tppubtype = {article} } The potential benefits of using population isolates in genetic mapping, such as reduced genetic, phenotypic and environmental heterogeneity, are offset by the challenges posed by the large amounts of direct and cryptic relatedness in these populations confounding basic assumptions of independence. We have evaluated four representative specialized methods for association testing in the presence of relatedness; (i) within-family (ii) within- and between-family and (iii) mixed-models methods, using simulated traits for 2906 subjects with known genome-wide genotype data from an extremely isolated population, the Island of Kosrae, Federated States of Micronesia. We report that mixed models optimally extract association information from such samples, demonstrating 88% power to rank the true variant as among the top 10 genome-wide with 56% achieving genome-wide significance, a >80% improvement over the other methods, and demonstrate that population isolates have similar power to non-isolate populations for observing variants of known effects. We then used the mixed-model method to reanalyze data for 17 published phenotypes relating to metabolic traits and electrocardiographic measures, along with another 8 previously unreported. We replicate nine genome-wide significant associations with known loci of plasma cholesterol, high-density lipoprotein, low-density lipoprotein, triglycerides, thyroid stimulating hormone, homocysteine, C-reactive protein and uric acid, with only one detected in the previous analysis of the same traits. Further, we leveraged shared identity-by-descent genetic segments in the region of the uric acid locus to fine-map the signal, refining the known locus by a factor of 4. Finally, we report a novel associations for height (rs17629022, P< 2.1 $times$ 10(-8)). |
13. | Kang, Hyun Min; Sul, Jae Hoon ; Service, Susan K; Zaitlen, Noah A; Kong, Sit-Yee Y; Freimer, Nelson B; Sabatti, Chiara ; Eskin, Eleazar Variance component model to account for sample structure in genome-wide association studies. Journal Article In: Nat Genet, 42 (4), pp. 348-54, 2010, ISSN: 1546-1718. @article{Kang:NatGenet:2010, title = {Variance component model to account for sample structure in genome-wide association studies.}, author = { Hyun Min Kang and Jae Hoon Sul and Susan K. Service and Noah A. Zaitlen and Sit-Yee Y. Kong and Nelson B. Freimer and Chiara Sabatti and Eleazar Eskin}, url = {http://dx.doi.org/10.1038/ng.548}, issn = {1546-1718}, year = {2010}, date = {2010-01-01}, journal = {Nat Genet}, volume = {42}, number = {4}, pages = {348-54}, address = {United States}, organization = {Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA.}, abstract = {Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure. |
14. | Kang, Hyun Min; Ye, Chun ; Eskin, Eleazar Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Journal Article In: Genetics, 180 (4), pp. 1909-25, 2008, ISSN: 0016-6731. @article{Kang:Genetics:2008b, title = {Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots.}, author = { Hyun Min Kang and Chun Ye and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.108.094201}, issn = {0016-6731}, year = {2008}, date = {2008-01-01}, journal = {Genetics}, volume = {180}, number = {4}, pages = {1909-25}, address = {United States}, organization = {Department of Human Genetics, University of California, Los Angeles, California 90095, USA.}, abstract = {In genomewide mapping of expression quantitative trait loci (eQTL), it is widely believed that thousands of genes are trans-regulated by a small number of genomic regions called "regulatory hotspots," resulting in "trans-regulatory bands" in an eQTL map. As several recent studies have demonstrated, technical confounding factors such as batch effects can complicate eQTL analysis by causing many spurious associations including spurious regulatory hotspots. Yet little is understood about how these technical confounding factors affect eQTL analyses and how to correct for these factors. Our analysis of data sets with biological replicates suggests that it is this intersample correlation structure inherent in expression data that leads to spurious associations between genetic loci and a large number of transcripts inducing spurious regulatory hotspots. We propose a statistical method that corrects for the spurious associations caused by complex intersample correlation of expression measurements in eQTL mapping. Applying our intersample correlation emended (ICE) eQTL mapping method to mouse, yeast, and human identifies many more cis associations while eliminating most of the spurious trans associations. The concordances of cis and trans associations have consistently increased between different replicates, tissues, and populations, demonstrating the higher accuracy of our method to identify real genetic effects.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In genomewide mapping of expression quantitative trait loci (eQTL), it is widely believed that thousands of genes are trans-regulated by a small number of genomic regions called "regulatory hotspots," resulting in "trans-regulatory bands" in an eQTL map. As several recent studies have demonstrated, technical confounding factors such as batch effects can complicate eQTL analysis by causing many spurious associations including spurious regulatory hotspots. Yet little is understood about how these technical confounding factors affect eQTL analyses and how to correct for these factors. Our analysis of data sets with biological replicates suggests that it is this intersample correlation structure inherent in expression data that leads to spurious associations between genetic loci and a large number of transcripts inducing spurious regulatory hotspots. We propose a statistical method that corrects for the spurious associations caused by complex intersample correlation of expression measurements in eQTL mapping. Applying our intersample correlation emended (ICE) eQTL mapping method to mouse, yeast, and human identifies many more cis associations while eliminating most of the spurious trans associations. The concordances of cis and trans associations have consistently increased between different replicates, tissues, and populations, demonstrating the higher accuracy of our method to identify real genetic effects. |