ipam-logoDear Colleagues,

I am happy to announce the UCLA Computational Genomics Summer Institute, which is a new National Institutes of Health funded program at UCLA jointly hosted with the Institute of Pure and Applied Mathematics (IPAM). The program will take place each summer for one month. The dates for 2016 are July 18th – August 12th.

The program focuses on providing training in methodology development for genomics. We hope that it will be of interest to researchers at all levels. Our program builds upon a successful program hosted by IPAM in 2011 on “Mathematical and Computational Approaches in High Throughput Biology.” IPAM is a national math institute funded by the National Science Foundation.

The program consists of two parts. The first part (July 18th – July 22nd) is the Short Program which is in the format of a short course consisting of lectures from leading researchers in computational genomics. The short program is appropriate for researchers at all levels including both researchers actively involved in methodology development as well as other researchers who want to incorporate a methodology development aspect to their research program.

The second part (July 21st – August 12th) is the Long Program which is a continuation of the Short Program. The program is in the style of a typical long program hosted at IPAM where participants have opportunity to interact and collaborate with each other as well as the leading researchers who will serve as program faculty. The program is targeted toward senior trainees such as senior students or post-docs through established researchers.

Researchers at all levels — students, post-docs, staff researchers, as well as junior and senior faculty — are encouraged to participate in the program. Funding is available to support faculty and participant costs during the program. Because space is limited in the program, we are requiring interested participants and potential program faculty to apply as soon as possible.

Application materials are available on the program website (http://computationalgenomics.bioinformatics.ucla.edu). For questions about the program, interested individuals should email uclacgsi@gmail.com.

The UCLA CGSI Organizing Committee
Eleazar Eskin, UCLA, CGSI Director
Russel Caflisch, UCLA. IPAM Director
Eran Halperin, Tel Aviv University
John Novembre, University of Chicago
Ben Raphael, Brown University

Tags: , ,

cacm-coverA couple of years ago I was asked to write a review article on the progress of my field (computational genetics) targeted toward computer scientists. My article “Discovering Genes Involved in Disease and the Mystery of Missing Heritability” was just published on the cover of the Communications of the ACM. This article is written to be an introduction to the field as well as describe the rapid progress over the past decade in terms of the discovery of large number of variants involved in common human diseases. The article is written assuming no background in biology and is designed to be accessible to researchers and students outside the field. I hope that it will encourage other computational researchers to get involved in genetics.  The journal also made a video highlighting this article which is available here:

Discovering Genes Involved in Disease and the Mystery of Missing Heritability from CACM on Vimeo.

The full citation to the article is:
Eskin, Eleazar (2015): Discovering Genes Involved in Disease and the Mystery of Missing Heritability. In: Commun. ACM, 58 (10), pp. 80-87, 2015, ISSN: 0001-0782. (Type: Journal Article | Abstract | Links | BibTeX)

Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider ‘causal variants’ as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations.

In our recently published work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability q. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2.

In the context of association studies, the genetic variants which are responsible for the association signal at a locus are referred to in the genetics literature as the ‘causal variants.’ Causal variants have biological effect on the phenotype.

CAVIAR-Gene provides better ranking of the causal genes for Outbred, F2, and HMDP datasets. Panels a and b illustrate the results for Outbred genotypes for case where we have one causal and two causal genes, respectively. Panels c and d illustrate the results for F2 genotypes for case where we have one causal and two causal genes, respectively. Panels e and f illustrate the results for Outbred genotypes for case where we have one causal and two causal genes, respectively.

CAVIAR-Gene provides better ranking of the causal genes for Outbred, F2, and HMDP datasets. Panels a and b illustrate the results for Outbred genotypes for case where we have one causal and two causal genes, respectively. Panels c and d illustrate the results for F2 genotypes for case where we have one causal and two causal genes, respectively. Panels e and f illustrate the results for Outbred genotypes for case where we have one causal and two causal genes, respectively.

Generally, variants can be categorized into three main groups. The first group is the causal variants which have a biological effect on the phenotype and are responsible for the association signal. The second group is the variants which are statistically associated with the phenotype due to LD with a causal variant. Even though association tests for these variants may be statistically significant, under our definition, they are not causal variants. The third group is the variants which are not statistically associated with the phenotype and are not causal.

CAVIAR-Gene is a statistical method for fine mapping that addresses two main limitations of existing methods. First, as opposed to existing approaches that focus on individual variants, we propose to search only over the space of gene combinations that explain the statistical association signal, and thus drastically reduce runtime. Second, CAVIAR-Gene extends existing framework for fine mapping to account for population structure. The output of our approach is a minimal set of genes that will contain the true casual gene at a pre-specified significance level.  The output of our approach is a minimal set of genes that will contain the true casual gene at a pre-specified significance level. This gene set together with its individual gene probability of causality provides a natural way of prioritizing genes for functional testing (e.g. knockout strategies) in model organisms. Through extensive simulations, we demonstrate that CAVIAR-Gene is superior to existing methodologies, requiring the smallest set of genes to follow-up in order to capture the true causal gene(s).

Building off our previous work with CAVIAR,  CAVIAR-Gene takes as input the marginal statistics for each variant at a locus, an LD matrix consisting of pairwise Pearson correlations computed between the genotypes of a pair of genetic variants, a partitioning of the set of variants in a locus into genes, and the kinship matrix which indicates the genetic similarity between each pair of individuals. Marginal statistics are computed using methods that correct for population structure.  We consider a variant to be causal when the variant is responsible for the association signal at a locus and aim to discriminate these variants from ones that are correlated due to LD.

In model organisms, the large stretches of LD regions result in a large number of variants associated in each region, thus making CAVIAR computationally

infeasible. Instead of producing a rho causal set of SNPs, CAVIAR-gene detects a ‘q causal gene set’ which is a set of genes in the locus that will contain the actual causal genes with probability of at least q.

For further details of our new method, CAVIAR-gene, view our full paper here:

Studies carried out over the last decade have revealed that gut microbiota contribute to a variety of common disorders, including obesity and diabetes (Musso et al. 2011), colitis (Devkota et al. 2012), atherosclerosis (Wang et al. 2011), rheumatoid arthritis (Vaahtovuo et al. 2008), and cancer (Yoshimoto et al. 2013). The evidence for metabolic interactions is particularly strong, as a large body of data now supports the conclusion that gut microbiota influence the energy harvest from dietary components, particularly complex carbohydrates, and that metabolites such as the short chain fatty acids produced by gut bacteria can perturb metabolic traits, including adiposity and insulin resistance (Turnbaugh et al. 2006; Backhed et al. 2007; Wen et al. 2008; Turnbaugh et al. 2009; Ridaura et al. 2013).

Gut microbiota communities are assembled by generation, influenced by maternal seeding, environmental factors, host genetics and age, resulting in substantial variations in composition among individuals in human populations (Eckburg et al. 2005; Costello et al. 2009; Huttenhower and Consortium 2012; Goodrich et al. 2014). Most experimental studies of host-gut microbiota interactions have employed large perturbations, such as comparisons of germ-free versus conventional mice, and the significance of common variations in gut microbiota composition for disease susceptibility is still poorly understood. Furthermore, while studies with germ-free mice have clearly implicated microbiota in clinically relevant traits, it has proven difficult to identify the responsible taxa of bacteria.

We now report a population-based analysis of host-gut microbiota interactions in the mouse. One of the issues we explore is the role of host genetics. Although some evidence is consistent with significant heritability of gut microbiota composition, the extent to which the host controls microbiota composition under controlled environmental conditions is unclear. We also examine the role of common variations in gut microbiota in metabolic traits such as obesity and insulin resistance. We performed our study using a resource termed the Hybrid Mouse Diversity Panel (HMDP), consisting of about 100 inbred strains of  mice that have been either sequenced or subjected to high density genotyping (Bennett et al. 2010). The resource has several advantages for genetic analysis as compared to traditional genetic crosses. First, it allows high resolution mapping by association rather than linkage analysis, and it has now been used for the identification of a number of novel genes underlying complex traits (Farber et al. 2011; Lavinsky et al. 2015; Parks et al. 2015; Rau et al. 2015). Second, since the strains are permanent the data from separate studies can be integrated, allowing the development of large, publically available databases of physiological and molecular traits relevant to a variety of clinical disorders (systems.genetics.ucla.edu and phenome.jax.org). Third, the panel is ideal for examining gene-by-environment interactions, since it is possible to examine individuals of a particular genotype under a variety of conditions (Orozco et al. 2012; Parks et al. 2013).

Genetics provides a potentially powerful approach to dissect host-gut microbiota interactions. Using a SNP-based approach with a linear mixed model we estimated the heritability of microbiota composition. We conclude that in a controlled environment the genetic background accounts for a significant fraction of abundance of most common microbiota.The mice were previously studied for response to a high fat, high sucrose diet, and we hypothesized that the dietary response was determined in part by gut microbiota composition. We tested this using a cross-fostering strategy in which a strain showing a modest response, SWR, was seeded with microbiota from a strain showing a strong response, AxB19. Consistent with a role of microbiota in dietary response, the cross-fostered SWR pups exhibited a significantly increased response in weight gain. To examine specific microbiota contributing to the response, we identified various genera whose abundance correlated with dietary response. In an effort to further understand host-microbiota interactions, we mapped loci controlling microbiota composition and prioritized candidate genes. Our publically available data provide a resource for future studies.

In our study, we concluded:

– In a total of 599 mice, 75% of them abundantly exhibited the same 17 genera

– These 17 genera accounted for 68% of reads

– Consistent with previous studies, changing diet drastically changes gut microbiota composition, and these shifts are strongly dependent on the genetic background of the mice

– Gut microbiota contribute to dietary responsiveness

– Several gut microbiota (known and novel to this study) contribute to obesity and metabolic phenotypes

– seven genome-wide significant loci (P < 4 x 10-6) were found to be associated with common genera

– We were able to estimated the heritability by using a linear mixed model approach andassuming an additive effect based on the proportion of phenotype variance accounted for by genetic relationships among the strains.

We began our study with the hypothesis that the dietary response was dictated in part by differences in gut microbiota. We showed that different inbred strains of mice differ strikingly in the composition of gut microbiota and provided evidence that the variation is determined in part by the host genetic background. Consistent with our hypothesis, we showed that cross-fostering between two strains of mice affected dietary response to the high fat, high sucrose diet. By correlating microbiota composition with dietary response among the HMDP inbred strains, we were able to identify several candidate microbiota influencing dietary response.

For all the details of our research and our methods, read our paper here.

Recently Zarlab hosted the first-ever Undergraduate Bioinformatics Speaker Series. Our lab has been steadily growing as our undergraduate research program becomes more robust, and we decided it was time we gave the undergrads an outlet of their own. Recently, the Computational Genetics Student Group (CGSG) was formed to serve the research, networking and extracurricular educational needs of the bioinformatics students (and those potentially interested in bioinformatics) at UCLA.

For our first event, we chose to explore the field of forensics and learn how bioinformatics and statistics can be used to solve crimes by analyzing DNA. Associate professor Kirk Lohmueller and Jill Licht, senior criminalist with the LA County Sheriff’s Department, gave insights into murder investigations where they served as expert witnesses. Kirk spoke about how the case was overthrown by the judge due to overlooking key forensic evidence. At the second trial, Kirk was able to testify to a potential second suspect whose blood was found at the crime scene. However, even with the additional DNA evidence, the jury still convicted the primary suspect based on a child’s eye witness account!

Jill was able to provide stories of what the day-to-day life of a forensic biologist is like. At least one week every month, she has to remain alert and ready to drive to the scene of a crime 24-hours a day. Sometimes she’ll get the call at 2 a.m. and have to drive an hour to get to the location. She explained how the Los Angeles Police Department only has jurisdiction in the city of Los Angeles, but the sheriff’s department oversees the rest of LA County. That means she could be called to anywhere from Pasadena to Long Beach. For someone who is squeamish at the sight of blood, Jill says she is able to handle it at work. The ultimate goal is to determine the story behind the scene, and she must stay focused in order to do her best work at the scene. Could you handle working with blood and brains?

If you are interested in this and future talks, leave us a comment below.

Over the past few years, genome-wide association studies (GWAS) have been used to find genetic variants that are involved in disease and other traits by testing for correlations between these traits and genetic variants across the genome. A typical GWAS examines the correlation of a single phenotype and each genotype one at a time. Recently, large amounts of genomic data such as expression data have been collected from GWAS cohorts. This data often contains thousands of phenotypes per individual. The standard approach to analyze this type of data is to perform a GWAS on each phenotype individually, a single-phenotype analysis.

A major flaw of the analysis strategy of analyzing phenotypes independently is that this strategy is underpowered. For example, unmeasured aspects of complex biological networks, such as protein mediators, could be captured with many phenotypes together that might be missed with a single phenotype or a few phenotypes. Previous methods are based on the assumption that the phenotypes of the individuals are independently and identically distributed (i.i.d.). Unfortunately, as has been shown in GWAS studies, this assumption is not valid due to a phenomenon referred to as population structure.

As we recently presented at the RECOMB 2015 conference, we propose a method called GAMMA (Generalized Analysis of Molecular variance for Mixed model Analysis) that efficiently analyzes large numbers of phenotypes while simultaneously considering population structure. Recently, the linear mixed model (LMM) has become a popular approach for GWAS, as it can correct for population structure. The LMM incorporates genetic similarities between all pairs of individuals, known as the kinship, into their model and corrects for population structure.

In figure 3 of our paper (shown above), we apply GAMMA to yeast data and compare to another popular method MDMR. The x-axis corresponds to SNP locations and the y-axis corresponds to gene locations. The y-axis corresponds to -log10 of p value. Blue stars above each plot show putative hotspots that were reported in a previous study for the yeast data.

In figure 3 of our paper (shown above), we apply GAMMA to yeast data and compare to another popular method MDMR. The x-axis corresponds to SNP locations and the y-axis corresponds to gene locations. The y-axis corresponds to -log10 of p value. Blue stars above each plot show putative hotspots that were reported in a previous study for the yeast data.

Unlike the traditional univariate analysis that tests an association between each phenotype and each genotype, our goal is to identify SNPs that are associated with multiple phenotypes. However, in GWAS, it has been widely known that genetic relatedness, referred to as population structure, complicates the analysis by creating spurious associations. The linear model does not account for population structure and assuming the linear model may induce many false positive identifications. Moreover, this could cause even more significant problem in multiple-phenotype analysis because the bias accumulates for each phenotype as their test statistics are summed over the phenotypes (See details in Material and Methods.). Recently, the linear mixed model has emerged as a powerful tool for GWAS as it could correct for population structure.

For more information on GAMMA, check out our full paper here:

Jerry Wang and Jae Hoon Sul, two lab alumni, published a paper introducing a new a two-stage model software for detecting associations between traits and pairs of SNPs using a threshold-based efficient pairwise association approach (TEPAA).  The method is significantly faster than the traditional approach of performing an association test with all pairs of SNPs.  In the first stage, the method performs the single marker test on all individual SNPs and selects a subset of SNPs that exceed a certain SNP-specific predetermined significance threshold for further consideration. In the second stage, individual SNPs that are selected in the first stage are paired with each other, and we perform the pairwise association test on those pairs.
The key insight of the approach is that the joint distribution is derived between the association statistics of single SNP and the association statistics of pairs of SNPs. This joint distribution provides guarantees that the statistical power of our approach will closely approximate the brute force approach. Then you can accurately compute the analytical power of our two-stage model and compare it to the power of the brute force approach. (See the Figure) Hence, the method chooses as few SNPs as possible in the first stage while achieving almost the same power as the brute force approach.
The power loss region of the threshold-based efficient pairwise association approach (TEPAA). The contour lines represent the probability density function of the multivariate normal distribution (MVN).  T1(subscript) is the threshold for the first stage.  Any SNP with a higher significance than T1 will be passed on to the second stage.  T2(subscript) is the threshold for significance of the pairwise test.  The area surrounded by the red rectangle corresponds to the power loss region.

The power loss region of the threshold-based efficient pairwise association approach (TEPAA). The contour lines represent the probability density function of the multivariate normal distribution (MVN). T1(subscript) is the threshold for the first stage. Any SNP with a higher significance than T1 will be passed on to the second stage. T2(subscript) is the threshold for significance of the pairwise test. The area surrounded by the red rectangle corresponds to the power loss region.

Jerry and Jae Hoon demonstrate the utility of TEPAA applied to the Northern Finland Birth Cohort (Rantakallio, 1969; Jarvelin et al., 2004).  From their analysis, they observe that the thresholds that control the power loss of the two-stage approach depend on the minor allele frequency (MAF) of the SNPs. In particular, more common SNPs can be filtered out with less significant thresholds than rare SNPs. In order to efficiently implement TEPAA using MAF dependent thresholds for each pair, we group the SNPs into bins based on their MAFs to apply the correct thresholds to each possible pair. After disregarding rare variants with MAF <  0.05, they categorize all common SNPs into nine bins according to their MAF, with step size 0.05. Each pair of SNPs would have two thresholds, one for each SNP in the first stage.  We precompute the first-stage thresholds for each combination of two MAFs in order to achieve 1% power loss,while achieving high cost savings. We sort the SNPs within each bin by their association statistics and use binary search to rapidly obtain the set of SNPs above a single threshold to efficiently implement the first stage of our method.

Read our full paper here:

What are the interesting computational ideas underlying a new computational method?  What are the intuitions behind the method?  How is the method related to other methods?  These are the key question that papers which describe new computational methods should be answering.
Unfortunately, most papers describing new computational methods don’t explicitly address these questions due to constraints of the journal styles.  Introduction of methods papers often have a only few sentences about the method.  The Methods section typically has many more details but has very little discussion of the underlying ideas.   Understanding what is interesting about a method is left completely to the readers imagination.  Often, the journals request that the Results section precede the Methods section which then makes understanding the results very difficult without the reader reading the sections of the paper out of order.  Authors can appeal to the journal to have the Methods section first, but this is also not a good solution since there are many details in the Methods such as descriptions of the datasets which take away from the flow of the paper.
In order to avoid these problems, in our papers, we make the first subsection of the Results section of the paper a “Methods Overview.”  In this section, we describe the method in terms of the high level ideas and typically include as a figure a small example which we utilize the help the reader understand the example.   The goal of this section is to give enough details that the readers can then follow the rest of the Results section without requiring looking at the Methods section.  A well written Methods Overview will make it much easier for the reader to follow the actual Methods section.
These sections and examples are designed to be self contained and should be in a language appropriate for a general audience.  In fact, some of the blog posts are almost verbatim copies of the Methods Overview sections of some of our recent papers.  For example, see these blog posts on GRAT and Genome Reassembly.
Another way to think of what to put in the Methods Overview section is what you would explain in a talk about the method.  Often presentations on computational methods have excellent slides showing intuitions and very clear examples.  The place to put that kind of material is in the Methods Overview.  Remember, in your paper you must give a compelling argument as to WHY your method is interesting. If your readers don’t understand the intuitions underlying your work, they will never appreciate it.
I’m sure you may be asking, “Isn’t this a little redundant?” What I’m proposing here may be a bit repetitive, with a methods overview section and a methods section later in the paper.  But they serve different purposes.  With a well written Methods Overview section, a reader can stop after the Results section and understand most of your paper.  The Methods section then only becomes important for someone who wants to understand all of the details.

In this blog post, I would like to “introduce” you to our introduction style. Writing the introduction is the most daunting part of the paper writing process, especially for students who are not native english speakers. To help structure the introduction writing process, in our lab we have developed a standard style or template for writing introductions. Since the majority of the papers that we write are papers that describe new computational methods, many of our papers naturally fit into this style. We usually publish our papers in Genetics journals which have very high standards of writing and are read by researchers with a wide range of backgrounds. The difference between a paper getting accepted and rejected is often determined by the clarity of the writing.

Our introduction style is a very specific formula that works for us but obviously there are other ways to structure an introduction and each experienced writer will have their own style. However, the truth is, you NEVER start out as a good writer and new writers need to start somewhere. It takes practice, consistency and effort to write well. If you are a new writer apprehensive about writing an introduction, we hope that this structure can help you.

Our introductions are typically four paragraphs long with each paragraph serving a specific role:
1. Context – First, it is important to explain the context of the research topic. Why is the general topic important? What is happening in the field today that makes this a valid topic of research?
2. Problem – Secondly, you present the problem . We typically start this paragraph with a “However,” phrase. Simple example: We have this awesome discovery in XYZ… However, using former methods it will take us 10 years to run the data. Each sentence in this paragraph should have a negative tone.
3. Solution – By this point, your readers should sympathize with how terrible this problem is and how there MUST be a solution (maybe a little dramatic, but you get my point). Paragraph three always starts with “in this paper” and a descritpion of what the paper proposes and how it solves the problem in the second paragraph.
4. Implication – The last paragraph in your introduction is the implication, which describes why your solution is important and moves the field forward. Typically, in this paragraph is where you summarize the experimental results and how they demonstrate that the solution solves the problem. This paragraph should answer the readers question of “so what?”.

An example of the 4 paragraph introduction style is in the following paper:

Mangul, Serghei; Wu, Nicholas; Mancuso, Nicholas; Zelikovsky, Alex; Sun, Ren; Eskin, Eleazar (2014): Accurate viral population assembly from ultra-deep sequencing data.. In: Bioinformatics, 30 (12), pp. i329-i337, 2014, ISSN: 1367-4811. (Type: Journal Article | Abstract | BibTeX)

Most of our other papers in their final form do not follow this format exactly.  But many of them in earlier drafts used this template and then during the revision process, added a paragraph or two expanding one of the paragraphs in the template.  For example, this paper expanded the implication to two paragraphs:

Kang, Eun Yong; Han, Buhm; Furlotte, Nicholas; Joo, Jong Wha; Shih, Diana; Davis, Richard; Lusis, Aldons; Eskin, Eleazar (2014): Meta-Analysis Identifies Gene-by-Environment Interactions as Demonstrated in a Study of 4,965 Mice. In: PLoS Genet, 10 (1), pp. e1004022, 2014, ISSN: 1553-7404. (Type: Journal Article | Abstract | Links | BibTeX)

and this paper expanded both the context and problem to two paragraphs each:

Sul, Jae Hoon; Han, Buhm; Ye, Chun; Choi, Ted; Eskin, Eleazar (2013): Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Meta-analytic Approaches. In: PLoS Genet, 9 (6), pp. e1003491, 2013, ISSN: 1553-7404. (Type: Journal Article | Abstract | Links | BibTeX)

For methods papers, sometimes what are proposing is an incremental improvement over another solution. In this case, moving from the context to the problem is very difficult without explaining the other solution. For this scenario, we suggest the following six-paragraph structure:
Problem 1 (the BIG problem)
Solution 1 (the previous method)
Problem 2 (Why does the previous method fall short?)
Solution 2 (“In this paper” you are going to improve Solution 1)

An example of 6 paragraph introductions where the 3rd and 4th paragraph were merged is:

Furlotte, Nicholas; Kang, Eun Yong; Nas, Atila Van; Farber, Charles; Lusis, Aldons; Eskin, Eleazar (2012): Increasing Association Mapping Power and Resolution in Mouse Genetic Studies Through the Use of Meta-analysis for Structured Populations.. In: Genetics, 191 (3), pp. 959-67, 2012, ISSN: 1943-2631. (Type: Journal Article | Abstract | Links | BibTeX)

There it is… the beginning to a great paper (at least we like to think so!). Will this work for you? Have other ideas? Let us know in the comments below!

This is an example of our edits.  The red marks are directly edits and the blue are high level comments.

This is an example of our edits. The red marks are directly edits and the blue are high level comments.

In our last writing post, we talked about how our group of a dozen undergrads, four PhDs and three postdocs (not to mention our many collaborators) stays organized. This week we would like to focus on our paper writing process, and more specifically, how we edit.

Believe it or not, each one of our papers goes through at least 30 rounds of edits before it’s submitted to be published. You read that right… 30 rounds of edits. Each round is very fast with usually a day or two of writing, and we try to give back comments within a few hours of getting the draft. Because we are doing so many iterations, the changes from round to round often only affect a small portion of the paper. The writing process begins in week one of the project. This is because no matter how early we start writing, at the end of the project, our bottleneck is the paper is not finished even though all of the experiments are complete. For that reason, starting writing the paper BEFORE the experiments are finished (or even started) leads to the paper being submitted much earlier. Some people feel that they shouldn’t write the paper until they know how the experiments are finished so they know what to say. I completely disagree with this position. I think it is better to start at least with the introduction, overview of the methods, the methods section, the references etc. If the experimental results are unexpected then the paper can be adapted to the results later. However, getting an early start on the writing substantially reduces the overall time that it takes to complete the paper.

To jump start the students writing, I sometimes ask them to send me a draft every day. We call this “5 p.m. drafts.” Just like we mentioned in our very first writing tips post, the best way to overcome writer’s block is to make writing a habit. What I find is that if I get a draft that is one day of work or a week of work from a student, it still needs the same amount of work. This is what motivates our writing many many many iterations.

This is an early edit where we did a lot of rewording. For this, we use notes or text boxes.

This is an early edit where we did a lot of rewording. For this, we use notes or text boxes.

Editing in our lab is certainly not done in red ink on paper. That would be WAY too difficult to coordinate the logistics. The way we do it is via a PDF emailed from the students. I edit it on my iPad using the GoodReader app, which can make notes, include text in callouts, draw diagrams and highlight directly on the document. GoodReader also lets me email the marked PDF back to the students directly. It typically takes 30 minutes to an hour to make a round of edits. This inexpensive iPad app has increased our workflow and decreased our edit turnaround significantly. Keep in mind that I don’t always need to make a full pass on the paper, but just give enough comments to keep the student busy during the next writing period (which can be one day).

Since my edits are marked on the PDF, the students needs to enter the edits into the paper. This is great for them as they get to see the edits and this improves their writing. Previously, when I would make edits on the paper directly, they wouldn’t be able to see them. When I edit, I make direct changes in red and general comments in blue.

Like our method? Let us know!

« Older entries

%d bloggers like this: