Gene-Gene Interactions Detection Using a Two-stage Model

Jerry Wang and Jae Hoon Sul, two lab alumni, published a paper introducing a new a two-stage model software for detecting associations between traits and pairs of SNPs using a threshold-based efficient pairwise association approach (TEPAA).  The method is significantly faster than the traditional approach of performing an association test with all pairs of SNPs.  In the first stage, the method performs the single marker test on all individual SNPs and selects a subset of SNPs that exceed a certain SNP-specific predetermined significance threshold for further consideration. In the second stage, individual SNPs that are selected in the first stage are paired with each other, and we perform the pairwise association test on those pairs.
The key insight of the approach is that the joint distribution is derived between the association statistics of single SNP and the association statistics of pairs of SNPs. This joint distribution provides guarantees that the statistical power of our approach will closely approximate the brute force approach. Then you can accurately compute the analytical power of our two-stage model and compare it to the power of the brute force approach. (See the Figure) Hence, the method chooses as few SNPs as possible in the first stage while achieving almost the same power as the brute force approach.
The power loss region of the threshold-based efficient pairwise association approach (TEPAA). The contour lines represent the probability density function of the multivariate normal distribution (MVN).  T1(subscript) is the threshold for the first stage.  Any SNP with a higher significance than T1 will be passed on to the second stage.  T2(subscript) is the threshold for significance of the pairwise test.  The area surrounded by the red rectangle corresponds to the power loss region.

The power loss region of the threshold-based efficient pairwise association approach (TEPAA). The contour lines represent the probability density function of the multivariate normal distribution (MVN). T1(subscript) is the threshold for the first stage. Any SNP with a higher significance than T1 will be passed on to the second stage. T2(subscript) is the threshold for significance of the pairwise test. The area surrounded by the red rectangle corresponds to the power loss region.

Jerry and Jae Hoon demonstrate the utility of TEPAA applied to the Northern Finland Birth Cohort (Rantakallio, 1969; Jarvelin et al., 2004).  From their analysis, they observe that the thresholds that control the power loss of the two-stage approach depend on the minor allele frequency (MAF) of the SNPs. In particular, more common SNPs can be filtered out with less significant thresholds than rare SNPs. In order to efficiently implement TEPAA using MAF dependent thresholds for each pair, we group the SNPs into bins based on their MAFs to apply the correct thresholds to each possible pair. After disregarding rare variants with MAF <  0.05, they categorize all common SNPs into nine bins according to their MAF, with step size 0.05. Each pair of SNPs would have two thresholds, one for each SNP in the first stage.  We precompute the first-stage thresholds for each combination of two MAFs in order to achieve 1% power loss,while achieving high cost savings. We sort the SNPs within each bin by their association statistics and use binary search to rapidly obtain the set of SNPs above a single threshold to efficiently implement the first stage of our method.

Read our full paper here:

Wang, Zhanyong; Sul, Jae Hoon; Snir, Sagi; Lozano, Jose A; Eskin, Eleazar

Gene-Gene Interactions Detection Using a Two-stage Model. Journal Article

In: J Comput Biol, 22 (6), pp. 563-76, 2015, ISSN: 1557-8666.

Abstract | Links | BibTeX

Genes, Environments and Meta-Analysis

Figure 1. Application of Meta-GxE to Apoa2 locus. The forest plot (A) shows heterogeneity in the effect sizes across different studies. The PM- plot (B) predicts that 7 studies have an effect at this locus, even though only 1 study (HMDP-chow(M)) is genome-wide significant with P-value. doi:10.1371/journal.pgen.1004022.g001

It is well known that both genetic factors and environmental factors contribute to traits and specifically disease risk. In addition, an area of great interest in the research community is the interaction between genetic factors and environmental factors and their contribution to disease risk and other traits. Genetic variants that are involved in gene by environment interactions (denoted GxE) have a different effect on the trait spending on the environment. For example, some variants can have an effect on cholesterol levels only in the presence of a high fat diet. Discovering variants involved in GxE has been tremendously difficult and even though thousands of variants have been implicated in disease related traits using genome wide association studies, very few variants have been implicated in GxEs. Part of the difficulty in detecting GxEs is that the traditional approach requires analyzing studies which contain individuals with multiple environments.

We have recently published a paper with the A. Jake Lusis group in PLoS Genetics which presents a novel approach to discovering GxEs. In our approach, many different studies, each which was performed in different environments, are combined to identify GxEs. The key idea is that if variants have a different genetic effect in different environments, then these variants are candidates for being involved in GxEs. Combining studies together is a statistical technique called meta-analysis which has been a major focus of our lab the past few years. We show in the paper, the mathematically, searching for GxEs using the traditional approach and a type of meta-analysis framework called the random effects model(21565292) are very closely related.

We applied our approach to identify GxEs affected mouse HDL cholesterol by combining 17 mouse studies collected by A. Jake Lusis’ group containing almost 5,000 animals. Our approach discovered 26 loci involved in HDL, many of which appear to be involved in GxE. Virtually all of these loci were not previously discovered in any of the individual studies, but many of them map to genes known to affect HDL. Our approach also includes a visualization framework called a PM-plot which helps interpret the associated loci to help identify GxE interactions(22396665).

From the paper:

Discovering environmentally-specific loci using meta-analysis
The Meta-GxE strategy uses a meta-analytic approach to identify gene-by-environment inter- actions by combining studies that collect the same phenotype under different conditions. Our method consists of four steps. First, we apply a random effects model meta-analysis (RE) to identify loci associated with a trait considering all of the studies together. The RE method explicitly models the fact that loci may have different effects in different studies due to gene-by- environment interactions. Second, we apply a heterogeneity test to identify loci with significant gene-by-environment interactions. Third, we compute the m-value of each study to identify in which studies a given variant has an effect and in which it does not. Forth, we visualize the result through a forest plot and PM-plot to understand the underlying nature of gene-by-environment interactions.
We illustrate our methodology by examining a well-known region on mouse chromosome 1 harboring the Apoa2 gene, which is known to be strongly associated with HDL cholesterol (8332912). Figure 1 shows the results of applying our method to this locus. We first compute the effect size and its standard deviation for each of the 17 studies. These results are shown as a forest plot in Figure 1 (a). Second we compute the P-value for each individual study also shown in Figure 1 (a). If we were to follow traditional methodology and evaluate each study separately, we would declare an effect present in a study if the P-value exceeds a predefined genome-wide significance threshold (P < 1.0×10−6). In this case, we would only identify the locus as associated in a single study, HMDP-chow(M) (P = 6.84×10−9). On the other hand, in our approach, we combine all studies to compute a single P-value for each locus taking into account heterogeneity between studies. This approach leads to increased power over the simple approach considering each study separately. The combined meta P-value for the Apoa2 locus is very significant (4.41 × 10−22), which is consistent with the fact that the largest individual study only has 749 animals compared to 4,965 in our combined study.
We visualize the results through a PM-plot, in which P-values are simultaneously visualized with the m-values, which estimates the posterior probability of an effect being present in a study given the observations from all other studies, at each tested locus. These plots allow us to identify in which studies a given variant has an effect and in which it does not. M-values for a given variant have the following interpretation: a study with a small m-value(≤ 0.1) is predicted not to be affected by the variant, while a study with a large m-value(≥ 0.9) is predicted to be affected by the variant.
The PM-plot for the Apoa2 locus is shown in Figure 1 (b). If we only look at the separate study P-values (y-axis), we can conclude that this locus only has an effect in HMDP-chow(M). However, if we look at m-value (x-axis), then we find 8 studies (HMDPxB-ath(M), HMDPxB- ath(F), HMDP-chow(M), HMDP-fat(M), HMDP-fat(F), BxD-db-5(M), BxH-apoe(M), BxH- apoe(F)), where we predict that the variation has an effect, while in 3 studies (BxD-db-12(F), BxD-db-5(F), BxH-wt(M)) we predict there is no effect. The predictions for the remaining 6 studies are ambiguous.
From Figure 1, we observe that differences in effect sizes among the studies are remarkably consistent when considering the environmental factors of each study as described in Table 1. For example, when comparing study 1 – 4, the effect size of the locus decreases in both the male and female HMDPxB studies in the chow diet (chow study) relative to the fat diet (ath study). Thus we can see that when the mice have Leiden/CETP transgene, which cause high total cholesterol level and high LDL cholesterol level, effect size of this locus on HDL cholesterol level in blood is affected by the fat level of diet. Similarly, when comparing study 12 – 15, the knockout of the Apoe gene affects the effect sizes for both male and female BxH crosses. However, in the BxD cross (study 8 – 11), where each animal is homozygous for a mutation causing a deficiency of the leptin receptor, the effect of the locus is very strong in the young male animals, while as animals get older and become fatter, the effect becomes weaker. However in the case of female mice, the effect of the locus is nearly absent at both 5 and 12 weeks of age. Thus we can see that sex plays an important role in affecting HDL when the leptin receptor activity is deficient .

The full citation of our paper is:

Kang, Eun Yong; Han, Buhm; Furlotte, Nicholas; Joo, Jong Wha J; Shih, Diana; Davis, Richard C; Lusis, Aldons J; Eskin, Eleazar

Meta-Analysis Identifies Gene-by-Environment Interactions as Demonstrated in a Study of 4,965 Mice Journal Article

In: PLoS Genet, 10 (1), pp. e1004022, 2014, ISSN: 1553-7404.

Abstract | Links | BibTeX