Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models

This year, our group published a paper in PLOS Genetics that describes our efforts to better understand and correct for population structure when computing gene-by-environment (GEI) statistics in genome-wide association studies (GWASs). We use simulated and actual GWAS datasets to demonstrate that population structure, the relatedness of individuals within a cohort, inflates test statistics for both GEIs and genetic variants. We present a novel mixed model method capable of improving accuracy when computing GEI statistics in GWAS. This method can be efficiently applied to GWAS datasets containing thousands of individuals and hundreds of thousands of SNPs.

GWASs have discovered many genetic variants associated with complex traits and diseases, yet these genetic variants explain only a small fraction of phenotypic variance in the human genome. Other sources of phenotypic variance include discrete environmental factors and GEIs, complex interactions between an individual’s genetic material and environmental factors. Recent GEI association analyses have demonstrated the importance of GEIs in complex traits and disease development. Identification of these causal GEIs would provide insight into disease pathways, particularly the effects of environmental factors in disease risk, and guide development of novel diagnostic tools and personalized therapies.

Several methodological challenges have limited successful identification of causal GEIs. As with standard GWAS approaches, GxE GWASs are prone to produce an inflated number of associations due to population structure. Unlike standard GWASs, we lack a method designed to avoid detection of these spurious associations when computing GEI statistics. Accounting for genetic similarity with a standard GWAS approach does control inflation of test statistics for causal SNPs, but does not control inflation of associated GEIs. Simultaneously accounting for both similarities would control both types of population structure known to confound GWASs—false associations caused by SNPs under selection and those caused by the remaining SNPs.

Our linear mixed model approach introduces two random effects and takes into account two types of similarities between individuals: overlap in the genome itself and overlap in genetic expression caused by complex interactions between genes and environment. We use a pair of kinship matrices corresponding to the two types of similarity to include these two random effects in the model and correct for population structure.

In order to better understand false associations in GxE GWASs, we compare our approach to two standard approaches. We apply the three methods to two large genomic datasets, one human and one mouse, that are known to contain population structure and have many quantitative phenotypes to test effect of GEIs. We use a standard GWAS method that does not correct for population structure (defined as “OLS” in our paper) and an approach that performs population structure correction for only SNP statistics (“One RE”). The last approach is our proposed mixed model approach that uses both genetic and GxE kinship to correct for population structure on both SNP and GEI statistics (“Two RE”).


Distribution of inflation factors of GEI statistics on HMDP GxE GWAS data. (A) Inflation factor for each phenotype with no population structure correction (OLS), population structure correction for SNP statistics (One RE), and population structure correction for both SNP and GEI statistics (Two RE). (B) QQ plot of one of the phenotypes (free fatty acids, ffa), showing the distributions of p-values of GEI statistics for the three methods.

In both datasets, even a moderate amount of population structure causes spurious GEIs when using standard approaches for identifying GEI in GWAS. While the One RE approach reduces inflation of test statistics on SNPs (see Supplement S1 Figure), it has almost the same or slightly higher inflation factors on GxE statistics when compared to OLS. Results from both datasets suggest that our approach effectively controls population structure when computing statistics for GEIs and genetic variants. We hope our method is useful advancing our understanding of how life-history influences an individual’s disease risk.

This project was led by Jae Hoon Sul and involved Michael Bilow. The article is available at:

The full citation to our paper is: 

Sul, Jae Hoon; Bilow, Michael; Yang, Wen-Yun Y; Kostem, Emrah; Furlotte, Nick; He, Dan; Eskin, Eleazar

Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models. Journal Article

In: PLoS Genet, 12 (3), pp. e1005849, 2016, ISSN: 1553-7404.

Abstract | Links | BibTeX

This approach uses our PyLMM software package available for download at:

Imputing Phenotypes for Genome-wide Association Studies


Pairwise correlation between each phenotype pair in the NFBC dataset.

In genome-wide association studies (GWAS), investigators identify variants that are significantly associated with the phenotype by collecting and performing statistical tests on genotypes and phenotypes from a set of individuals. Recently, GWAS samples have increased in size to include tens or hundreds of thousands of variants. Studies working with such large datasets have recently discovered hundreds of variants involved in multiple common diseases (Schunkert et al. 2011; Voight et al. 2010). For the most part, identified variants have very small effect sizes, suggesting that larger association studies are capable of implicating more variants.

Increasing the size of GWAS samples is a shared goal among bioinformatics researchers. Unfortunately, some phenotypes are either logistically difficult or very expensive to collect. For these phenotypes, it is impractical to perform GWAS with tens or hundreds of thousands of individuals. Examples of these difficult-to-collect phenotypes include those that require obtaining an inaccessible tissue (such as brain expression), using a complex intervention (such as a response to diet), and re-contacting individuals simply because they were unmeasured in the original cohort. For these phenotypes, an investigator finds it difficult to collect samples large enough to discover variants with small effect sizes. As a result, it is unlikely that GWAS will perform effectively on these phenotypes.

To address this issue, we developed a novel approach we call phenotype imputation. In our method, we estimate and leverage the correlation structure between multiple phenotypes to impute the uncollected phenotype. A paper presenting our approach was accepted by and is in press with the American Journal of Human Genetics.

In order to leverage the correlation structure between multiple phenotypes, we first estimate the correlation structure from a complete dataset that includes all phenotypes. We then use the conditional distribution based on the multivariate normal (MVN) statistical framework to impute the uncollected phenotypes in an incomplete dataset. Our approach uses only phenotypic—not genetic—information, enabling subsequent use of these imputed phenotypes for association testing without incurring data re-use. For GWAS including both complete and incomplete datasets, we provide an optimal meta-analysis strategy that accounts for imputation uncertainties by combining association results from both collected and imputed phenotypes. Further, our paper demonstrates that phenotype imputation can be performed using summary statistics. This result makes our method applicable to datasets where we only have access to the summary statistics and not the raw genotypes and phenotypes.

In our forthcoming AJHG paper, we use the Northern Finland Birth Cohort (NFBC) data to assess the performance of our novel method. The NFBC dataset consists of 10 phenotypes collected from 5,327 individuals. The 10 phenotypes are triglycerides (TG), highdensity lipoproteins (HDL), low-density lipoproteins (LDL), glucose (GLU), insulin (INS), body mass index (BMI), C-reactive protein (CRP) as a measure of inflammation, systolic blood pressure (SBP), diastolic blood pressure (DBP), and height. The genotype data consists of 331,476 SNPs.

Imputing the TG, BMI, and SBP phenotypes enable us to recover most of the significantly associated loci in the original data at the nominal significance level, as shown in the above figure. This result demonstrates that the imputed phenotype can effectively be used for replication purposes, even though it might not provide sufficient power for discovery purposes due to imputation uncertainties.

Our approach allows us to know the exact distribution of the imputed phenotype due to our parametric assumptions. We can directly use the mean value of this distribution as the imputed value. Furthermore, we utilize the variance of the missing phenotype in our analysis of the statistical power. The primary advantage of our framework is that it increases the power of GWASs on phenotypes that are difficult to collect. Analytical power computation is provided that allows investigators to determine the benefit of the imputation for a given dataset prospectively. Another advantage of this method is that it allows the use of summary statistics when the raw genotypes are not available.

This project was led by Farhad Hormozdiari and involved Michael Bilow. The article is available at:

The full citation to our paper is:

Hormozdiari, Farhad ; Kang, Eun Yong ; Bilow, Michael ; Ben-David, Eyal ; Vulpe, Chris ; McLachlan, Stela ; Lusis, Aldons J; Han, Buhm ; Eskin, Eleazar

Imputing Phenotypes for Genome-wide Association Studies. Journal Article

In: Am J Hum Genet, 99 (1), pp. 89-103, 2016, ISSN: 1537-6605.

Abstract | Links | BibTeX