Writing Tips: How we Edit

This is an example of our edits.  The red marks are directly edits and the blue are high level comments.

This is an example of our edits. The red marks are directly edits and the blue are high level comments.

In our last writing post, we talked about how our group of a dozen undergrads, four PhDs and three postdocs (not to mention our many collaborators) stays organized. This week we would like to focus on our paper writing process, and more specifically, how we edit.

Believe it or not, each one of our papers goes through at least 30 rounds of edits before it’s submitted to be published. You read that right… 30 rounds of edits. Each round is very fast with usually a day or two of writing, and we try to give back comments within a few hours of getting the draft. Because we are doing so many iterations, the changes from round to round often only affect a small portion of the paper. The writing process begins in week one of the project. This is because no matter how early we start writing, at the end of the project, our bottleneck is the paper is not finished even though all of the experiments are complete. For that reason, starting writing the paper BEFORE the experiments are finished (or even started) leads to the paper being submitted much earlier. Some people feel that they shouldn’t write the paper until they know how the experiments are finished so they know what to say. I completely disagree with this position. I think it is better to start at least with the introduction, overview of the methods, the methods section, the references etc. If the experimental results are unexpected then the paper can be adapted to the results later. However, getting an early start on the writing substantially reduces the overall time that it takes to complete the paper.

To jump start the students writing, I sometimes ask them to send me a draft every day. We call this “5 p.m. drafts.” Just like we mentioned in our very first writing tips post, the best way to overcome writer’s block is to make writing a habit. What I find is that if I get a draft that is one day of work or a week of work from a student, it still needs the same amount of work. This is what motivates our writing many many many iterations.

This is an early edit where we did a lot of rewording. For this, we use notes or text boxes.

This is an early edit where we did a lot of rewording. For this, we use notes or text boxes.

Editing in our lab is certainly not done in red ink on paper. That would be WAY too difficult to coordinate the logistics. The way we do it is via a PDF emailed from the students. I edit it on my iPad using the GoodReader app, which can make notes, include text in callouts, draw diagrams and highlight directly on the document. GoodReader also lets me email the marked PDF back to the students directly. It typically takes 30 minutes to an hour to make a round of edits. This inexpensive iPad app has increased our workflow and decreased our edit turnaround significantly. Keep in mind that I don’t always need to make a full pass on the paper, but just give enough comments to keep the student busy during the next writing period (which can be one day).

Since my edits are marked on the PDF, the students needs to enter the edits into the paper. This is great for them as they get to see the edits and this improves their writing. Previously, when I would make edits on the paper directly, they wouldn’t be able to see them. When I edit, I make direct changes in red and general comments in blue.

Like our method? Let us know!

Simultaneous Genetic Analysis of more than One Trait

Most methods that try to understand the relationship between an individual’s genetics and traits analyze one trait at a time. Our lab recently published a paper focusing on analyzing multiple traits together. This subject is significant because analyzing multiple traits can discover more genetic variants that affect traits, but the analysis methods are challenging and often very computationally inefficient. This is especially the case for mixed-model methods which take into account the relatedness among individuals in the study. These approaches both increase power and provide insights into the genetic architecture of multiple traits. In particular, it is possible to estimate the genetic correlation that is a measure of the portion of the total correlation between traits that is due to additive genetic effects.

In our recent paper, we aim to solve this problem by introducing a technique that can be used to assess genome-wide association quickly, reducing analysis time from hours to seconds. Our method is called a Matrix Variate Linear Mixed Model (mvLMM) and is similar to the method recently developed by Mathew Stephen’s group ((22706312)). Our method is available as a software which works together with the pylmm software that we are developing on mixed models which is available at http://genetics.cs.ucla.edu/pylmm/. An implementation of this method is available at http://genetics.cs.ucla.edu/mvLMM/.

We demonstrate the efficacy of our method by analyzing correlated traits in the Northern Finland Birth Cohort ((19060910)). Comparing to a standard approach ((22843982); (22902788)), we show that our method results in more than a 10-fold time reduction for a pair of correlated traits, taking the analysis time from about 35 minutes to about 2.5 minutes for the cubic operations plus another 12 seconds for the iterative part of the algorithm. In addition, the cubic operation can be saved so that it does not have to be re-calculated when analyzing other traits in the same cohort. Finally, we demonstrate how this method can be used to analyze gene expression data. Using a well-studied yeast dataset ((18416601)), we show how estimation of the genetic and environmental components of correlation between pairs of genes allows us for to understand the relative contribution of genetics and environment to coexpression.

One of the key ideas of our approach is to represent the multiple phenotypes as a matrix where the rows are individuals and the columns are traits. We then assume the data follows a “matrix variate normal” distribution where we define a covariance structure on the trait among the rows (individuals) and columns (traits). The use of the matrix variate normal is the key to making our algorithm efficient.

The full paper about mvLMM is below:

Furlotte, Nicholas A; Eskin, Eleazar

Efficient Multiple Trait Association and Estimation of Genetic Correlation Using the Matrix-Variate Linear Mixed-Model. Journal Article

In: Genetics, 200 (1), pp. 59-68, 2015, ISSN: 1943-2631.

Abstract | Links | BibTeX

 

**Update** Since publishing, it has been brought to our attention there is related work published by Karin Meyer in 1985 (which cited earlier work by Robin Thompson from 1976) we did not cite. If our method interests you, please also take a moment to review the following paper:

Meyer, K

Maximum Likelihood Estimation of Variance Components for a Multivariate Mixed Model with Equal Design Matrices Journal Article

In: Biometrics, 41 (1), pp. pp. 153-165, 1985, ISSN: 0006341X.

Abstract | Links | BibTeX

Bibliography

US-Israel Binational Science Foundation and Gilbert Foundation Renew Support

We are very happy to announce the US-Israel Binational Science Foundation (BSF) in partnership with the Gilbert Foundation are renewing support of our collaboration with Eran Halperin’s group in Tel Aviv University. This is our labs oldest active collaboration which began in 2001 when Professor Eskin met Eran Halperin at the RECOMB conference.

Our first joint project in genetics was a collaboration with Eran Halperin in 2003 (who was in Berkeley, CA at the time) on a problem called haplotype phasing and led to the software HAP ((14988101)). That led us to become involved in the first whole-genome map of human variation, which was published on the cover of Science in 2005 ((15718463)). We have continued to work closely and publish together because we have very complementary backgrounds. We came from machine learning and Eran come from theory. We have many joint projects, regular conference calls and visits, and collaborations between our students. One of my Ph.D. students was a post doc in Professor Halperin’s group and one of his post docs was recruited to UCLA as a faculty member.

Many of our most important research contributions have been jointly authored papers. This includes our work on characterizing genetic diversity using spatial ancestry analysis (SPA-(22610118)) and genotyping common and rare variants in very large population studies using overlapping pool sequencing, which can be used for the detection of cancer fusion genes from RNA sequences ((21989232)).

Thanks to the additional funding from BSF, we are expanding our current goals to address the problem of analysis of genetic data in conjunction with other data types such as epigenetic data (changes to the DNA along one’s lifetime) and RNA expression. There is strong evidence that these additional signals can provide more insights to the mechanisms of the disease, for example, epigenetic changes have been shown to be strongly related to certain diseases and environmental effects.

Further, the project enables an exchange of ideas and collaborations between not only myself and Eran but also between our students. Everyone involved benefits from this collaboration of Israeli and American scientists. This is our first BSF project and we are very grateful for the support of our collaboration.

To read the full article on our collaboration and the BSF, please click here.

Bibliography