Gene-Gene Interactions Detection Using a Two-stage Model

Jerry Wang and Jae Hoon Sul, two lab alumni, published a paper introducing a new a two-stage model software for detecting associations between traits and pairs of SNPs using a threshold-based efficient pairwise association approach (TEPAA).  The method is significantly faster than the traditional approach of performing an association test with all pairs of SNPs.  In the first stage, the method performs the single marker test on all individual SNPs and selects a subset of SNPs that exceed a certain SNP-specific predetermined significance threshold for further consideration. In the second stage, individual SNPs that are selected in the first stage are paired with each other, and we perform the pairwise association test on those pairs.
The key insight of the approach is that the joint distribution is derived between the association statistics of single SNP and the association statistics of pairs of SNPs. This joint distribution provides guarantees that the statistical power of our approach will closely approximate the brute force approach. Then you can accurately compute the analytical power of our two-stage model and compare it to the power of the brute force approach. (See the Figure) Hence, the method chooses as few SNPs as possible in the first stage while achieving almost the same power as the brute force approach.
The power loss region of the threshold-based efficient pairwise association approach (TEPAA). The contour lines represent the probability density function of the multivariate normal distribution (MVN).  T1(subscript) is the threshold for the first stage.  Any SNP with a higher significance than T1 will be passed on to the second stage.  T2(subscript) is the threshold for significance of the pairwise test.  The area surrounded by the red rectangle corresponds to the power loss region.

The power loss region of the threshold-based efficient pairwise association approach (TEPAA). The contour lines represent the probability density function of the multivariate normal distribution (MVN). T1(subscript) is the threshold for the first stage. Any SNP with a higher significance than T1 will be passed on to the second stage. T2(subscript) is the threshold for significance of the pairwise test. The area surrounded by the red rectangle corresponds to the power loss region.

Jerry and Jae Hoon demonstrate the utility of TEPAA applied to the Northern Finland Birth Cohort (Rantakallio, 1969; Jarvelin et al., 2004).  From their analysis, they observe that the thresholds that control the power loss of the two-stage approach depend on the minor allele frequency (MAF) of the SNPs. In particular, more common SNPs can be filtered out with less significant thresholds than rare SNPs. In order to efficiently implement TEPAA using MAF dependent thresholds for each pair, we group the SNPs into bins based on their MAFs to apply the correct thresholds to each possible pair. After disregarding rare variants with MAF <  0.05, they categorize all common SNPs into nine bins according to their MAF, with step size 0.05. Each pair of SNPs would have two thresholds, one for each SNP in the first stage.  We precompute the first-stage thresholds for each combination of two MAFs in order to achieve 1% power loss,while achieving high cost savings. We sort the SNPs within each bin by their association statistics and use binary search to rapidly obtain the set of SNPs above a single threshold to efficiently implement the first stage of our method.

Read our full paper here:

Wang, Zhanyong; Sul, Jae Hoon; Snir, Sagi; Lozano, Jose A; Eskin, Eleazar

Gene-Gene Interactions Detection Using a Two-stage Model. Journal Article

In: J Comput Biol, 22 (6), pp. 563-76, 2015, ISSN: 1557-8666.

Abstract | Links | BibTeX

Writing Tips: Methods Overview

What are the interesting computational ideas underlying a new computational method?  What are the intuitions behind the method?  How is the method related to other methods?  These are the key question that papers which describe new computational methods should be answering.
Unfortunately, most papers describing new computational methods don’t explicitly address these questions due to constraints of the journal styles.  Introduction of methods papers often have a only few sentences about the method.  The Methods section typically has many more details but has very little discussion of the underlying ideas.   Understanding what is interesting about a method is left completely to the readers imagination.  Often, the journals request that the Results section precede the Methods section which then makes understanding the results very difficult without the reader reading the sections of the paper out of order.  Authors can appeal to the journal to have the Methods section first, but this is also not a good solution since there are many details in the Methods such as descriptions of the datasets which take away from the flow of the paper.
In order to avoid these problems, in our papers, we make the first subsection of the Results section of the paper a “Methods Overview.”  In this section, we describe the method in terms of the high level ideas and typically include as a figure a small example which we utilize the help the reader understand the example.   The goal of this section is to give enough details that the readers can then follow the rest of the Results section without requiring looking at the Methods section.  A well written Methods Overview will make it much easier for the reader to follow the actual Methods section.
These sections and examples are designed to be self contained and should be in a language appropriate for a general audience.  In fact, some of the blog posts are almost verbatim copies of the Methods Overview sections of some of our recent papers.  For example, see these blog posts on GRAT and Genome Reassembly.
Another way to think of what to put in the Methods Overview section is what you would explain in a talk about the method.  Often presentations on computational methods have excellent slides showing intuitions and very clear examples.  The place to put that kind of material is in the Methods Overview.  Remember, in your paper you must give a compelling argument as to WHY your method is interesting. If your readers don’t understand the intuitions underlying your work, they will never appreciate it.
I’m sure you may be asking, “Isn’t this a little redundant?” What I’m proposing here may be a bit repetitive, with a methods overview section and a methods section later in the paper.  But they serve different purposes.  With a well written Methods Overview section, a reader can stop after the Results section and understand most of your paper.  The Methods section then only becomes important for someone who wants to understand all of the details.

Writing Tips: Introduction

In this blog post, I would like to “introduce” you to our introduction style. Writing the introduction is the most daunting part of the paper writing process, especially for students who are not native english speakers. To help structure the introduction writing process, in our lab we have developed a standard style or template for writing introductions. Since the majority of the papers that we write are papers that describe new computational methods, many of our papers naturally fit into this style. We usually publish our papers in Genetics journals which have very high standards of writing and are read by researchers with a wide range of backgrounds. The difference between a paper getting accepted and rejected is often determined by the clarity of the writing.

Our introduction style is a very specific formula that works for us but obviously there are other ways to structure an introduction and each experienced writer will have their own style. However, the truth is, you NEVER start out as a good writer and new writers need to start somewhere. It takes practice, consistency and effort to write well. If you are a new writer apprehensive about writing an introduction, we hope that this structure can help you.

Our introductions are typically four paragraphs long with each paragraph serving a specific role:
1. Context – First, it is important to explain the context of the research topic. Why is the general topic important? What is happening in the field today that makes this a valid topic of research?
2. Problem – Secondly, you present the problem . We typically start this paragraph with a “However,” phrase. Simple example: We have this awesome discovery in XYZ… However, using former methods it will take us 10 years to run the data. Each sentence in this paragraph should have a negative tone.
3. Solution – By this point, your readers should sympathize with how terrible this problem is and how there MUST be a solution (maybe a little dramatic, but you get my point). Paragraph three always starts with “in this paper” and a descritpion of what the paper proposes and how it solves the problem in the second paragraph.
4. Implication – The last paragraph in your introduction is the implication, which describes why your solution is important and moves the field forward. Typically, in this paragraph is where you summarize the experimental results and how they demonstrate that the solution solves the problem. This paragraph should answer the readers question of “so what?”.

An example of the 4 paragraph introduction style is in the following paper:

Mangul, Serghei; Wu, Nicholas C; Mancuso, Nicholas; Zelikovsky, Alex; Sun, Ren; Eskin, Eleazar

Accurate viral population assembly from ultra-deep sequencing data. Journal Article

In: Bioinformatics, 30 (12), pp. i329-i337, 2014, ISSN: 1367-4811.

Abstract | Links | BibTeX

Most of our other papers in their final form do not follow this format exactly.  But many of them in earlier drafts used this template and then during the revision process, added a paragraph or two expanding one of the paragraphs in the template.  For example, this paper expanded the implication to two paragraphs:

Kang, Eun Yong; Han, Buhm; Furlotte, Nicholas; Joo, Jong Wha J; Shih, Diana; Davis, Richard C; Lusis, Aldons J; Eskin, Eleazar

Meta-Analysis Identifies Gene-by-Environment Interactions as Demonstrated in a Study of 4,965 Mice Journal Article

In: PLoS Genet, 10 (1), pp. e1004022, 2014, ISSN: 1553-7404.

Abstract | Links | BibTeX

and this paper expanded both the context and problem to two paragraphs each:

Sul, Jae Hoon; Han, Buhm ; Ye, Chun ; Choi, Ted ; Eskin, Eleazar

Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Meta-analytic Approaches Journal Article

In: PLoS Genet, 9 (6), pp. e1003491, 2013, ISSN: 1553-7404.

Abstract | Links | BibTeX

For methods papers, sometimes what are proposing is an incremental improvement over another solution. In this case, moving from the context to the problem is very difficult without explaining the other solution. For this scenario, we suggest the following six-paragraph structure:
Context
Problem 1 (the BIG problem)
Solution 1 (the previous method)
Problem 2 (Why does the previous method fall short?)
Solution 2 (“In this paper” you are going to improve Solution 1)
Implication

An example of 6 paragraph introductions where the 3rd and 4th paragraph were merged is:

Furlotte, Nicholas A; Kang, Eun Yong; Nas, Atila Van; Farber, Charles R; Lusis, Aldons J; Eskin, Eleazar

Increasing Association Mapping Power and Resolution in Mouse Genetic Studies Through the Use of Meta-analysis for Structured Populations. Journal Article

In: Genetics, 191 (3), pp. 959-67, 2012, ISSN: 1943-2631.

Abstract | Links | BibTeX

There it is… the beginning to a great paper (at least we like to think so!). Will this work for you? Have other ideas? Let us know in the comments below!