Gene-Gene Interactions Detection Using a Two-stage Model

Jerry Wang and Jae Hoon Sul, two lab alumni, published a paper introducing a new a two-stage model software for detecting associations between traits and pairs of SNPs using a threshold-based efficient pairwise association approach (TEPAA).  The method is significantly faster than the traditional approach of performing an association test with all pairs of SNPs.  In the first stage, the method performs the single marker test on all individual SNPs and selects a subset of SNPs that exceed a certain SNP-specific predetermined significance threshold for further consideration. In the second stage, individual SNPs that are selected in the first stage are paired with each other, and we perform the pairwise association test on those pairs.
The key insight of the approach is that the joint distribution is derived between the association statistics of single SNP and the association statistics of pairs of SNPs. This joint distribution provides guarantees that the statistical power of our approach will closely approximate the brute force approach. Then you can accurately compute the analytical power of our two-stage model and compare it to the power of the brute force approach. (See the Figure) Hence, the method chooses as few SNPs as possible in the first stage while achieving almost the same power as the brute force approach.
The power loss region of the threshold-based efficient pairwise association approach (TEPAA). The contour lines represent the probability density function of the multivariate normal distribution (MVN).  T1(subscript) is the threshold for the first stage.  Any SNP with a higher significance than T1 will be passed on to the second stage.  T2(subscript) is the threshold for significance of the pairwise test.  The area surrounded by the red rectangle corresponds to the power loss region.

The power loss region of the threshold-based efficient pairwise association approach (TEPAA). The contour lines represent the probability density function of the multivariate normal distribution (MVN). T1(subscript) is the threshold for the first stage. Any SNP with a higher significance than T1 will be passed on to the second stage. T2(subscript) is the threshold for significance of the pairwise test. The area surrounded by the red rectangle corresponds to the power loss region.

Jerry and Jae Hoon demonstrate the utility of TEPAA applied to the Northern Finland Birth Cohort (Rantakallio, 1969; Jarvelin et al., 2004).  From their analysis, they observe that the thresholds that control the power loss of the two-stage approach depend on the minor allele frequency (MAF) of the SNPs. In particular, more common SNPs can be filtered out with less significant thresholds than rare SNPs. In order to efficiently implement TEPAA using MAF dependent thresholds for each pair, we group the SNPs into bins based on their MAFs to apply the correct thresholds to each possible pair. After disregarding rare variants with MAF <  0.05, they categorize all common SNPs into nine bins according to their MAF, with step size 0.05. Each pair of SNPs would have two thresholds, one for each SNP in the first stage.  We precompute the first-stage thresholds for each combination of two MAFs in order to achieve 1% power loss,while achieving high cost savings. We sort the SNPs within each bin by their association statistics and use binary search to rapidly obtain the set of SNPs above a single threshold to efficiently implement the first stage of our method.

Read our full paper here:

Sorry, no publications matched your criteria.

Thesis Defense: Dr. Zhanyong (Jerry) Wang

Jerry Wang defended his thesis on September 8, 2014 in 4760 Boelter Hall.

His thesis topic was Efficient Statistical Models For Detection And Analysis Of Human Genetic Variations. The video of his full defense can be viewed on the ZarlabUCLA YouTube page here.

Abstract: 

In recent years, the advent of genotyping and sequencing technologies has enabled human genetics to discover numerous genetic variants. Genetic variations between individuals can range from Single Nucleotide Polymorphisms (SNPs) to differences in large segments of DNA, which are referred to as Structural Variations (SVs), including insertions, deletions, and copy number variations (CNVs).

First proposed was a probabilistic model, CNVeM, to detect CNVs from High-Throughput Sequencing (HTS) data. The experiment showed that CNVeM can estimate the copy numbers and boundaries of copied regions more precisely than previous methods.

Genome-wide association studies (GWAS) have discovered numerous individual SNPs involved in genetic traits. However, it is likely that complex traits are influenced by interaction of multiple SNPs. In his thesis, Jerry proposed a two-stage statistical model, TEPAA, to reduce the computational time greatly while maintaining almost identical power to the brute force approach which considers all combinations of SNP interactions. The experiment on the Northern Finland Birth Cohort data showed that TEPAA achieved 63 times speedup.

Another drawback of GWAS is that rare causal variants will not be identified. Rare causal variants are likely to be introduced in a population recently and are likely to be in shared Identity-By-Descent (IBD) segments. Jerry proposed a new test statistic to detect IBD segments associated with quantitative traits and made a connection between the proposed statistic and linear models so that it does not require permutations to assess the significance of an association. In addition, the method can control population structure by utilizing linear mixed models.

 

The full paper on topics covered in Jerry’s thesis defense can be found below:

Sorry, no publications matched your criteria.