Up to this point, I’m sure most of you are saying, “That’s great, but what about YOUR lab? What do YOU do?”

Following the advice in the book How to Write a Lot by Paul Silvia, (see this our blog entry about the book here), I (and everyone else in the lab) set aside time exclusively for writing.  Given that at any time there are over a dozen papers in various states written in the lab, how to allocate that time across the different projects is not that obvious.  This piece of advice is probably more appropriate for someone running their own lab, and not a student.
    What I do is inspired by the book’s advice to create a priority list of our writing projects based on how close each paper is to being completed.  Our lab has been tracking our papers and projects in Evernote monthly since October 2012 and continued to the present.  Overall, this approach, as well as setting aside dedicated time for writing, has significantly increased our lab’s overall productivity. We finish our papers much faster and spend less time being “stuck” without making much progress for long periods of time.
      Here is exactly how we organize our Evernote notebook.  Each month I create a new note (this month’s note was called “Paper Organization February 2015”).  It starts as a copy of the previous month’s note and is updated as things change throughout the month.
        The Evernote document has several lists of papers in order to how close they are to be completed.  Each paper entry in the list has a short title as well as the key student authors working on the paper.
          Submitted Papers:
          These papers are currently under review.  They are in this list because we don’t need to do any actual writing work, but periodically, we should check with the editors to see what is going on with the review process.  In the note, I keep track of where the paper is submitted.  Even when a paper is accepted,  I still keep it on this list until it appears in print and in Pubmed.  This way we can keep track of the paper through the proof editing process, uploading copyright forms, etc.  The reason these papers are listed first is that it only takes a few minutes to check in to see if anything needs to be done with any of these papers, but if something needs to be done, it is usually urgent.
            Revise and Resubmit Papers:
            This category is to track the papers when they have come back from review.  Regardless of whether the paper was accepted/rejected or whether or not the journal is willing to review another version, what we need to do is revise the paper taking into account the feedback and get it resubmitted as quickly as possible.  If the journal is willing to take a revision, then we also need to write the response to reviewers.  Since these papers are so close to being completed and published, any paper in this category takes priority over the remaining. During my allocated writing time, I usually spend the time writing the response to the reviewers and helping organize with the students what edits need to be made to address the reviews.
              Active Papers:
              This category keeps track of any paper that is currently being written by someone in the lab as their primary project.  I check in on these papers regularly and hopefully whenever my scheduled writing time comes around, I have a draft of one of these papers from a student who works on it, and I can make a pass on the paper and send back the edits.  If I don’t have any edits, I have the list of the students who I can send a reminder to ask for them.
                Future Papers:
                This category keeps track of the papers in the lab that we plan to work on or were working on before but the student who was working on the paper is no longer pushing it forward.  The reason we keep them separate from the Active Papers category is to keep it from distracting us when we are setting our writing priorities.  Anything in this category isn’t being actively pursued.
                  A few other categories that we have experimented with over the years is keeping track of “Collaborator’s Papers” where we are involved in the analysis, keeping track of “Grants” that we are writing, and keeping track of “Collaborator’s Grants” where we are responsible for contributing sections.
                    Our lab is pretty big right now and currently, we have eight submitted papers, seven papers we are revising after reviews and 14 papers which are currently being actively written by a student. Many of these papers will be completed and published in the next six months, but for a select few, we may be working on them for the next two years. Unfortunately, this is typical, as a paper which was just published from our lab was originally submitted for the first time in December 2012.  Keeping track of these papers in this way helps us keep organized and to prioritize our efforts.
                      Have any methods that work for you? Would you like to comment on what you’ve read so far? We’d love to hear from you!

                      Tags: , , , , ,

                      In our last post we wrote about how to overcome writer’s block and the fear of writing. So now you’re on a schedule, and you’re ready to tackle this “writing thing.” You wake up, coffee and computer in tow, but there’s just one problem: You still can’t write! What gives?!

                      In his book, How to Write a Lot, Paul Silva, PhD acknowledges that academic writing doesn’t get easy the moment you get on a schedule. (Silva, 2007) Before you were full of adrenaline and motivated by impending deadlines. Now that you are writing a few times a week, you aren’t in the this anxiety-laden “write of be written off” state anymore. According to Silva, there are three steps to getting your writing juices flowing:

                      1.  Set goals.
                      2.  Determine priorities.
                      3.  Track your progress.

                      Let’s start with goals. Clear and concise goals in themselves should be motivating. Goals give you a plan of action, a sense of direction and a deadline. What do you want to write about? What projects are you working on? Are there some papers that need revising? At first, make an exhaustive list of everything you would like to accomplish. Secondly, organize it into a list you can conquer. Break this plan of action into monthly, weekly and daily goals.

                      This takes us to Silva’s second phase of finding your motivation: determine priorities. With some writing projects, there are not set deadlines. Our lab is constantly developing new software and performing research projects. Some projects take weeks, months or even get revised over a period of a few years! The research most often comes before the writing. There are, however, those moments when we write grant proposals. If we miss the deadline, we get none of the funding. Writing assignments like these definitely take more priority the closer we get to the due date.

                      The third and final step to finding your motivation is tracking your progress. What better way to see how far you’ve come and the work ahead than to keep inventory of your writing. Behavioral research show that self-observation alone can cause the desired behaviors (Korotitsch & Nelson-Gray, 1999), in this instance writing. If you keep yourself accountable, whether that means in your planner, on your phone or with a wonderful spreadsheet we all love so much (only slightly kidding– every plan deserves a good spreadsheet.), you are more likely to stick to your schedule and meet your goals.

                      In our lab, we do this through a systematic process which we will reveal in our next blog post. We have records that date back nearly three years of every project we have ever started, finished and everything in between. We have sections for published works, active papers, grants, collaborations and future research projects.

                      Check us out next week for an outline of how our lab has reached writing success.

                      Hope this helps! Give it a shot and let us know what you think in the comments below.

                      Cited publications:

                      Korotitsch, W.J., & Nelson-Gray, R. O. (1999). An overview of self-monitoring research in assessment and treatment. Psychological Assessment, 11, 415-425.

                      Silva, P.J. (2007). How to Write a Lot. 29-40.

                      Interested in obtaining a copy? Here’s a link to Amazon.

                      Tags: , , , , , , ,

                      Many who write regularly know what it’s like to be at a loss for words. Some days we can churn out ten pages and others we struggle to write ten sentences. Writing is hard, which is why it is intimidating to a lot of people, whether you’re a student or you’ve been publishing papers for years. There can be a dozen reasons why we can’t find the right words: can’t find the time, don’t feel inspired, too many distractions…

                      The key to developing great writing is all in the habit of writing frequently. Writing must be intentional. If you wait for the world to provide you with the perfect conditions to write (Spring Break, perhaps?), you won’t be doing much writing at all. Instead of finding time to write, you must MAKE time to write. Create a schedule and make writing a productive part of your day. A draft is never perfect the first, second or even eighth time it is written, but I can assure you it gets better every time.

                      This may sound like a stretch, I know. The rebuttals are already coming to mind: I really don’t have time. I have a busy schedule. I need to escape my routine to write. My life is sooo unpredictable.

                      What’s the worst thing that can happen if you give this a shot for the next three weeks? Set aside a time, at least a few times a week, to focus and write. Making writing intentional has to be a better option than writing your paper at the last minute on a Saturday skipping meals on no sleep, right?

                      So here is my challenge to you: get off the Internet, silence your phone and start writing!

                      Tags:

                      Jerry Wang defended his thesis on September 8, 2014 in 4760 Boelter Hall.

                      His thesis topic was Efficient Statistical Models For Detection And Analysis Of Human Genetic Variations. The video of his full defense can be viewed on the ZarlabUCLA YouTube page here.

                      Abstract: 

                      In recent years, the advent of genotyping and sequencing technologies has enabled human genetics to discover numerous genetic variants. Genetic variations between individuals can range from Single Nucleotide Polymorphisms (SNPs) to differences in large segments of DNA, which are referred to as Structural Variations (SVs), including insertions, deletions, and copy number variations (CNVs).

                      First proposed was a probabilistic model, CNVeM, to detect CNVs from High-Throughput Sequencing (HTS) data. The experiment showed that CNVeM can estimate the copy numbers and boundaries of copied regions more precisely than previous methods.

                      Genome-wide association studies (GWAS) have discovered numerous individual SNPs involved in genetic traits. However, it is likely that complex traits are influenced by interaction of multiple SNPs. In his thesis, Jerry proposed a two-stage statistical model, TEPAA, to reduce the computational time greatly while maintaining almost identical power to the brute force approach which considers all combinations of SNP interactions. The experiment on the Northern Finland Birth Cohort data showed that TEPAA achieved 63 times speedup.

                      Another drawback of GWAS is that rare causal variants will not be identified. Rare causal variants are likely to be introduced in a population recently and are likely to be in shared Identity-By-Descent (IBD) segments. Jerry proposed a new test statistic to detect IBD segments associated with quantitative traits and made a connection between the proposed statistic and linear models so that it does not require permutations to assess the significance of an association. In addition, the method can control population structure by utilizing linear mixed models.

                       

                      The full paper on topics covered in Jerry’s thesis defense can be found below:

                      Wang, Zhanyong; Sul, Jae Hoon; Snir, Sagi; Lozano, Jose; Eskin, Eleazar (2014): Gene-Gene Interactions Detection Using a Two-Stage Model. In: Research in Computational Molecular Biology, pp. 340-355, Springer International Publishing, 2014. (Type: Inbook | Abstract | Links | BibTeX)

                      Tags: ,

                      Figure 1 (A and B) Simulated data for two regions with different LD patterns that contain 35 SNPs. A and B are obtained by considering the 100 kbp upstream and downstream of rs10962894 and rs4740698, respectively, from the Wellcome Trust Case–Control Consortium study for coronary artery disease (CAD). (C and D) The rank of the causal SNP in additional simulations for the regions in A and B, respectively. We obtain these histograms from simulation data by randomly generating GWAS statistics using multivariate normal distribution. We apply the simulation 1000 times.

                      Figure 1 (A and B) Simulated data for two regions with different LD patterns that contain 35 SNPs. A and B are obtained by considering the 100 kbp upstream and downstream of rs10962894 and rs4740698, respectively, from the Wellcome Trust Case–Control Consortium study for coronary artery disease (CAD).

                      Our group in collaboration with our UCLA colleague Bogdan Pasanuic’s group recently published two papers focusing on “statistical fine mapping”. We published a paper on a method called CAVIAR in the journal Genetics and Bogdan’s lab published a method called PAINTOR in PLoS Genetics. The software is available at http://genetics.cs.ucla.edu/caviar/ and http://bogdan.bioinformatics.ucla.edu/software/PAINTOR/.

                      Although genome-wide association studies have successfully identified thousands of regions of the genome which contain genetic variation involved in disease, only a handful of the biologically causal variants, responsible for these associations, have been successfully identified. Because of the correlation structure of genetic variants, in each region, there are many variants that are associated with disease. The process of predicting which subset of the genetic variants are actually responsible for the association is referred to as statistical mapping.

                      Current statistical methods for identifying causal variants at risk loci either use the strength of association signal in an iterative conditioning framework, or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus which is typically invalid at many risk loci. In our papers, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g. 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants.

                      Figure 2 Simulated association with two causal SNPs. (A) The 100-kbp region around the rs10962894 SNP and simulated statistics at each SNP generated assuming two SNPs are causal. In this example SNP25 and SNP29 are considered as the causal SNPs. However, the most significant SNP is the SNP27. (B) The causal set selected by CAVIAR (our method) and the top k SNPs method. We ranked the selected SNPs based on the association statistics. The gray bars indicate the selected SNPs by both methods, the green bars indicate the selected SNPs by the top k SNPs method only, and the blue bars indicate the selected SNPs by CAVIAR only. The CAVIAR set consists of SNP17, SNP20, SNP21, SNP25, SNP26, SNP28, and SNP29. For the top k SNPs method to capture the two causal SNPs we have to set k to 11, as one of the causal SNPs is ranked 11th based on its significant score. Unfortunately, knowing the value of k beforehand is not possible. Even if the value of k is known, the causal set selected by our method excludes SNP30–SNP35 from the follow-up studies and reduces the cost of follow-up studies by 30% compared to the top k method.

                      Figure 2 Simulated association with two causal SNPs. (A) The 100-kbp region around the rs10962894 SNP and simulated statistics at each SNP generated assuming two SNPs are causal. In this example SNP25 and SNP29 are considered as the causal SNPs. However, the most significant SNP is the SNP27. (B) The causal set selected by CAVIAR (our method) and the top k SNPs method. We ranked the selected SNPs based on the association statistics. The gray bars indicate the selected SNPs by both methods, the green bars indicate the selected SNPs by the top k SNPs method only, and the blue bars indicate the selected SNPs by CAVIAR only. The CAVIAR set consists of SNP17, SNP20, SNP21, SNP25, SNP26, SNP28, and SNP29. For the top k SNPs method to capture the two causal SNPs we have to set k to 11, as one of the causal SNPs is ranked 11th based on its significant score. Unfortunately, knowing the value of k beforehand is not possible. Even if the value of k is known, the causal set selected by our method excludes SNP30–SNP35 from the follow-up studies and reduces the cost of follow-up studies by 30% compared to the top k method.

                      From the CAVIAR paper:
                      Overview of statistical fine mapping

                      Our approach, CAVIAR, takes as input the association statistics for all of the SNPs (variants) at the locus together with the correlation structure between the variants obtained from a reference data set such as the HapMap (Gibbs et al. 2003; Frazer et al. 2007) or 1000 Genomes project (Abecasis et al. 2010) data. Using this information, our method predicts a subset of the variants that has the property that all the causal SNPs are contained in this set with the probability r (we term this set the “r causal set”). In practice we set r to values close to 100%, typically $95%, and let CAVIAR find the set with the fewest number of SNPs that contains the causal SNPs with probability at least r. The causal set can be viewed as a confidence interval. We use the causal set in the follow-up studies by validating only the SNPs that are present in the set. While in this article we discuss SNPs for simplicity, our approach can be applied to any type of genetic variants, including structural variants.

                      We used simulations to show the effect of LD on the resolution of fine mapping. We selected two risk loci (with large and small LD) to showcase the effect of LD on fine mapping (see Figure 1, A and B). The first region is obtained by considering 100 kbp upstream and downstream of the rs10962894 SNP from the coronary artery disease (CAD) case–control study. As shown in the Figure 1A, the correlation between the significant SNP and the neighboring SNPs is high. We simulated GWAS statistics for this region by taking advantage that the statistics follow a multivariate normal dis- tribution, as shown in Han et al. (2009) and Zaitlen et al. (2010) (see Materials and Methods). CAVIAR selects the true causal SNP, which is SNP8, together with six additional variants (Figure 1A). Thus, when following up this locus, we have only to consider these SNPs to identify the true causal SNPs. The second region showcases loci with lower LD (see Figure 1B). In this region only the true causal SNP is selected by CAVIAR (SNP18). As expected, the size of the r causal set is a function of the LD pattern in the locus and the value of r, with higher values of r resulting in larger sets (see Table S1 and Table S2).

                      We also showcase the scenario of multiple causal variants (see Figure 2). We simulated data as before and considered SNP25 and SNP29 as the causal SNPs. Interestingly, the most significant SNP (SNP27, see Figure 2) tags the true causal variants but it is not itself causal, making the selection based on strength of association alone under the assumption of a single causal or iterative conditioning highly suboptimal. To capture both causal SNPs at least 11 SNPs must be selected in ranking based on P-values or probabilities estimated under a single causal variant assumption. As opposed to existing ap- proaches, CAVIAR selects both SNPs in the 95% causal set together with five additional variants. The gain in accuracy of our approach comes from accurately disregarding SNP30–SNP35 from consideration since their effects can be captured by other SNPs.

                      PAINTOR extended the CAVIAR model to also take into account the function of the genetic variation.

                      The full citations for the two papers are:

                      Hormozdiari, Farhad; Kostem, Emrah; Kang, Eun Yong; Pasaniuc, Bogdan; Eskin, Eleazar (2014): Identifying causal variants at Loci with multiple signals of association.. In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631. (Type: Article | Abstract | Links | BibTeX)
                      Kichaev, Gleb; Yang, Wen-Yun; Lindstrom, Sara; Hormozdiari, Farhad; Eskin, Eleazar; Price, Alkes; Kraft, Peter; Pasaniuc, Bogdan (2014): Integrating functional data to prioritize causal variants in statistical fine-mapping studies.. In: PLoS Genet, 10 (10), pp. e1004722, 2014, ISSN: 1553-7404. (Type: Article | Abstract | Links | BibTeX)

                      Overview of high-fidelity sequencing protocol. (A) DNA material from a viral population is cleaved into sequence fragments using any suitable restriction enzyme. (B) Individual barcode sequences are attached to the fragments. Each tagged fragment is amplified by the polymerase chain reaction (PCR). (C) Amplified fragments are then sequenced. (D) Reads are grouped according to the fragment of origin based on their individual barcode sequence. An error-correction protocol is applied for every read group, correcting the sequencing errors inside the group and producing corrected consensus reads. (E) Error-corrected reads are mapped to the population consensus. (F) SNVs are detected and assembled into individual viral genomes. The ordinary protocol lacks steps (B) and (D)

                      Overview of high-fidelity sequencing protocol. (A) DNA material from a viral population is cleaved into sequence fragments using any suitable restriction enzyme. (B) Individual barcode sequences are attached to the fragments. Each tagged fragment is amplified by the polymerase chain reaction (PCR). (C) Amplified fragments are then sequenced. (D) Reads are grouped according to the fragment of origin based on their individual barcode sequence. An error-correction protocol is applied for every read group, correcting the sequencing errors inside the group and producing corrected consensus reads. (E) Error-corrected reads are mapped to the population consensus. (F) SNVs are detected and assembled into individual viral genomes. The ordinary protocol lacks steps (B) and (D)

                      Viral populations change rapidly throughout the course of an infection. Due to this, drugs that initially control an infection can rapidly become ineffective as the viral drug target mutates. For better drug design, however, we first must develop techniques to be able to detect and quantify the presence of various rare viral variants from a given sample.

                      Currently, next-generation sequencing technologies are employed to better understand and quantify viral population diversity. The existing technologies, however, have difficulty if distinguishing between rare viral variants and sequencing errors even when sequencing with high coverage. To overcome this problem, our lab has proposed a two-step solution in a recent paper by Serghei Mangul.

                      In his paper, Serghei suggested we first use a high-fidelity protocol known as Safe-SeqS with high coverage. This method employs the use of small individual barcodes that are attached to sequencing fragments before undergoing amplification by polymerase chain reaction (PCR) and being sequenced. By comparing and taking a consensus of amplicons from the same initial sequence fragment, we can easily eliminate some sequencing errors from our data.

                      These consensus reads then are assembled using an accurate viral assembly method Serghei developed known as the Viral Genome Assembler (VGA). This software uses read overlapping, SNV detection, and a conflict graph to distinguish and reconstruct genome variants in the population. Finally, an expectation-maximization algorithm is used to estimate abundances of assembled viral variants.

                      In the paper, this approach was applied to both simulated and real data and found to outperform current state-of-the-art methods. Additionally, this viral assembly method is the first of its kind to scale to millions of sequencing reads.

                      The Viral Genome Assembler tool is freely available here: http://genetics.cs.ucla.edu/vga/

                      From the paper:
                      Advances in NGS and the ability to generate deep coverage data in the form of millions of reads provide exceptional resolution for studying the underlying genetic diversity of complex viral populations. However, errors produced by most sequencing protocols complicate distinguishing between true biological mutations and technical artifacts that confound detection of rare mutations and rare individual genome variants. A common approach is to use post-sequencing error correction techniques able to partially correct the sequencing errors. In contrast to clonal samples, the post-sequencing error correction methods are not well suited for mixed viral samples and may lead to filtering out true biological mutations. For this reason, current viral assembly methods are able to detect only highly abundant SNV, thus limiting the discovery of rare viral genomes.

                      Additional difficulty arises from the genomic architectures of viruses. Long common regions shared across viral population (known as conserved regions) introduce ambiguity in the assembly process. Conserved regions may be due low-diversity population or due to recombination with multiple cross-overs. In contrast to repeats in genome assembly, conserved regions may be phased based on relative abundances of viral variants. Low-diversity viral populations in which all pairs of individual genomes within a viral population have a small genetic distance from each other may represent additional challenges for the assembly procedure.

                      We apply a high-fidelity sequencing protocol to study viral population structure (Fig. 1). This protocol is able to eliminate errors from sequencing data by attaching individual barcodes during the library preparation step. After the fragments are sequenced, the barcodes identify clusters of reads that originated from the same fragment, thus facilitating error correction. Given that many reads are required to sequence each fragment, we are trading off an increase in sequence coverage for a reduction in error rate. Prior to assembly, we utilize the de novo consensus reconstruction tool, Vicuna (Yang et al., 2012), to produce a linear consensus directly from the sequence data. This approach offers more flexibility for samples that do not have ‘close’ reference sequences available. Traditional assembly methods (Gnerre et al., 2011; Luo et al., 2012; Zerbino and Birney, 2008) aim to reconstruct a linear consensus sequence and are not well-suited for assembling a large number of highly similar but distinct viral genomes. We instead take our ideas from haplotype assembly methods (Bansal and Bafna, 2008; Yang et al., 2013), which aim to reconstruct two closely related haplotypes. However, these methods are not applicable for assembly of a large (a priori unknown) number of individual genomes. Many existing viral assemblers estimate local population diversity and are not well suited for assembling full-length quasi-species variants spanning the entire viral genome. Available genome-wide assemblers able to reconstruct full-length quasi-species variants are originally designed for low throughput and are impractical for high throughput technologies containing millions of sequencing reads.

                      Overview of VGA. (A) The algorithm takes as input paired-end reads that have been mapped to the population consensus. (B) The first step in the assembly is to determine pairs of conflicting reads that share different SNVs in the overlapping region. Pairs of conflicting reads are connected in the ‘conflict graph’. Each read has a node in the graph, and an edge is placed between each pair of conflicting reads. (C) The graph is colored into a minimal set of colors to distinguish between genome variants in the population. Colors of the graph correspond to independent sets of non-conflicting reads that are assembled into genome variants. In this example, the conflict graph can be minimally colored with four colors (red, green, violet and turquoise), each representing individual viral genomes. (D) Reads of the same color are then assembled into individual viral genomes. Only fully covered viral genomes are reported. (E) Reads are assigned to assembled viral genomes. Read may be shared across two or more viral genomes. VGA infers relative abundances of viral genomes using the expectation–maximization algorithm. (F) Long conserved regions are detected and phased based on expression profiles. In this example red and green viral genome share a long conserved region (colored in black). There is no direct evidence how the viral sub-genomes across the conserved region should be connected. In this example four possible phasing are valid. VGA use the expression information of every sub-genome to resolve ambiguous phasing.

                      Overview of VGA. (A) The algorithm takes as input paired-end reads that have been mapped to the population consensus. (B) The first step in the assembly is to determine pairs of conflicting reads that share different SNVs in the overlapping region. Pairs of conflicting reads are connected in the ‘conflict graph’. Each read has a node in the graph, and an edge is placed between each pair of conflicting reads. (C) The graph is colored into a minimal set of colors to distinguish between genome variants in the population. Colors of the graph correspond to independent sets of non-conflicting reads that are assembled into genome variants. In this example, the conflict graph can be minimally colored with four colors (red, green, violet and turquoise), each representing individual viral genomes. (D) Reads of the same color are then assembled into individual viral genomes. Only fully covered viral genomes are reported. (E) Reads are assigned to assembled viral genomes. Read may be shared across two or more viral genomes. VGA infers relative abundances of viral genomes using the expectation–maximization algorithm. (F) Long conserved regions are detected and phased based on expression profiles. In this example red and green viral genome share a long conserved region (colored in black). There is no direct evidence how the viral sub-genomes across the conserved region should be connected. In this example four possible phasing are valid. VGA use the expression information of every sub-genome to resolve ambiguous phasing.

                      We introduce a viral population assembly method (Fig. 2) working on highly accurate sequencing data able to detect rare variants and tolerate conserved regions shared across the population. Our method is coupled with post-assembly procedures able to detect and resolve ambiguity raised from long conserved regions using expression profiles (Fig. 2F). After a consensus has been reconstructed directly from the sequence data, our method detects SNVs from the aligned sequencing reads. Read overlapping is used to link individual SNVs and distinguish between genome variants in the population. The viral population is condensed in a conflict graph built from aligned sequencing data. Two reads are originated from different viral genomes if they share different SNVs in the overlapping region. Viral variants are identified from the graph as independent sets of non-conflicting reads. Non-continuous coverage of rare viral variants may limit assembly capacities, indicating that increase in coverage is required to increase the assembly accuracy. Frequencies of identified variants are then estimated using an expectation–maximization algorithm. Compared with existing approaches, we are able to detect rare population variants while achieving high assembly accuracy.

                      The full citation of our paper is:

                      Mangul, Serghei; Wu, Nicholas; Mancuso, Nicholas; Zelikovsky, Alex; Sun, Ren; Eskin, Eleazar (2014): Accurate viral population assembly from ultra-deep sequencing data.. In: Bioinformatics, 30 (12), pp. i329-i337, 2014, ISSN: 1367-4811. (Type: Article | Abstract | BibTeX)

                      We teach a course called “Computational Genetics” each year at UCLA. This course is taken by both graduate and undergraduate students from both the Computer Science department and the many biology and medical school programs. In this course we cover both topics related to genome wide association studies (GWAS) and topics related to next generation sequencing studies. One lecture that is given each year is an introductory lecture to sequencing and read mapping. The video of this lecture is available here. Please excuse the poor cinematography. This lecture was recorded from the back of the classroom.

                      Tags: ,

                      Eun Yong Kang in our group defended his thesis on Monday Nov 25th, 2013. 2:30pm – 4:30pm in 4760 Boelter Hall.

                      The title of his defense was “Computational Genetic Approaches for Understanding the Genetic Architecture of Complex Traits”. The video of this defense is now available here. Fortunately for the lab, Eun is now a post-doc in the group.

                      The abstract of his thesis defense was:
                      Recent advances in genotyping and sequencing technology have enabled researchers to collect an enormous amount of high-dimensional genotype data. These large scale genomic data provide unprecedented opportunity for researchers to study and analyze the genetic factors of human complex traits. One of the major challenges in analyzing these high-dimensional genomic data is requiring effective and efficient computational methodologies. In this talk, I will focus on three problems that I have worked on. First, I will introduce a method for inferring biological networks from high-throughput data containing both genetic variation and gene expression profiles from genetically distinct strains of an organism. For this problem, I use causal inference techniques to infer the presence or absence of causal relationships between yeast gene expressions in the framework of graphical causal models. Second, I introduce efficient pairwise identity by descent (IBD) association mapping method, which utilizes importance sampling to improve efficiency and enable approximation of extremely small p-values. Using the WTCCC type 1 diabetes data, I show that Fast-Pairwise cansuccessfully pinpoint a gene known to be associated to the disease within the MHC region. Finally, I introduce a novel meta analytic approach (Meta-GxE) to identify gene-by-environment interactions by aggregating the multiple studies with varying environmental conditions. Meta-GxE approach jointly analyze multiple studies with varying environmental conditions using a meta-analytic approach based on a random effects model to identify loci involved in gene-by-environment interactions. This approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional uni- or multi-variate approaches for discovery of gene-by-environment interactions. Application of this approach to 17 mouse studies identify 26 significant loci involved in High-density lipoprotein (HDL) cholesterol, many of which show significant evidence of involvement in gene-by-environment interactions.

                      Eun’s talk covered the following papers:

                      Kang, Eun Yong; Han, Buhm; Furlotte, Nicholas; Joo, Jong Wha; Shih, Diana; Davis, Richard; Lusis, Aldons; Eskin, Eleazar (2014): Meta-Analysis Identifies Gene-by-Environment Interactions as Demonstrated in a Study of 4,965 Mice. In: PLoS Genet, 10 (1), pp. e1004022, 2014, ISSN: 1553-7404. (Type: Article | Abstract | Links | BibTeX)
                      Han, Buhm; Kang, Eun Yong; Raychaudhuri, Soumya; de Bakker, Paul; Eskin, Eleazar (2013): Fast Pairwise IBD Association Testing in Genome-wide Association Studies.. In: Bioinformatics, 2013, ISSN: 1367-4811. (Type: Article | Abstract | Links | BibTeX)
                      Kang, Eun Yong; Ye, Chun; Shpitser, Ilya; Eskin, Eleazar (2010): Detecting the presence and absence of causal relationships between expression of yeast genes with very few samples.. In: J Comput Biol, 17 (3), pp. 533-46, 2010, ISSN: 1557-8666. (Type: Article | Abstract | Links | BibTeX)

                      Tags: ,

                      Our DNA can tell us a lot about who our relatives are. Recently, several companies including 23andMe and AncestryDNA now provide services where they collect DNA from individuals and then match the DNA to a database of the DNA of other people to identify relatives. Relatives are then informed by the company that their DNAs match. Our lab was interested if we can perform this same type of service but without involving a company and more generally without involving any third party. One way to do this would be to have individuals obtain their own DNA sequences and then share their DNA sequences directly with each other. Unfortunately, DNA sequences are considered medical information and it is inappropriate to share them in this way.

                      Through a collaboration between our lab and the UCLA cryptography group, we recently published a paper that combines cryptography and genetics which describes an approach for identifying relatives without compromising privacy. Our paper was published in the April 2014 issue of Genome Research. The key ideas is that individuals release an encrypted version of their DNA information. Another individual can download this encrypted version and then use their own DNA information to try to decrypt it. If the are related to each other, their DNA sequences will be close enough that the decryption will work telling the individual that they are related. While if they are unrelated, the decryption will fail. What is important in this approach is that individuals who are not related do not obtain any information about each other’s DNA sequences.

                      The intuitive idea behind the approach is the following. Individuals each release a copy of their own genomes encrypted with a key that is based on the genome itself. Other users then download this encrypted information and try to decrypt it using their own genomes as the key. The encryption scheme is designed to allow for decryption if the encrypting key and decrypting key are “close enough”. Since related individuals share a portion of their genomes, we set the threshold for “close enough” to be exactly the threshold of relatedness that we want to detect.

                      Our approach uses a relatively new type of cryptographic technique called Fuzzy Extractors which were pioneered by our co-authors on this study, Amit Sahai and Rafail Ostrovsky. This type of technique allows for encryption and decryption with keys that match inexactly. Students in our group who were involved are Dan He, Nick Furlotte, Farhad Hormozdiari, and Jong Wha (Joanne) Joo. This research was supported by National Science Foundation grant 1065276.

                      The full citation of our paper is here:

                      He, Dan; Furlotte, Nicholas; Hormozdiari, Farhad; Joo, Jong Wha; Wadia, Akshay; Ostrovsky, Rafail; Sahai, Amit; Eskin, Eleazar (2014): Identifying genetic relatives without compromising privacy.. In: Genome Res, 2014, ISSN: 1549-5469. (Type: Article | Abstract | Links | BibTeX)

                      Tags: , , , , ,

                      mouse-phylogeny-slideI recently gave a talk on mixed models and confounding factors which is a long time interest of our research group at a workshop which is part of the Evolutionary Biology and the Theory of Computing program which was held at the Simons Institute on the UC Berkeley Campus. The talk was held on February 21st. This talk spans many years of work in our group including work by Hyun Min Kang (now at Michigan), Noah Zaitlen (now at UCSF), and Jimmie Ye (now at Harvard) as well as a sneak peak at very recent work by Joanne Joo, Jae-Hoon Sul and Buhm Han.

                      The video of the talk is available here and is also on our YouTube Channel ZarlabUCLA.

                      The papers which are covered in the talk include the EMMA, EMMAX and ICE papers published in 2008 as well as a very new paper that should be coming out soon. The key papers from the talk are:

                      Kang, Hyun Min; Sul, Jae Hoon; Service, Susan; Zaitlen, Noah; Kong, Sit-Yee; Freimer, Nelson; Sabatti, Chiara; Eskin, Eleazar (2010): Variance component model to account for sample structure in genome-wide association studies.. In: Nat Genet, 42 (4), pp. 348-54, 2010, ISSN: 1546-1718. (Type: Article | Abstract | Links | BibTeX)
                      Kang, Hyun Min; Zaitlen, Noah; Wade, Claire; Kirby, Andrew; Heckerman, David; Daly, Mark; Eskin, Eleazar (2008): Efficient control of population structure in model organism association mapping.. In: Genetics, 178 (3), pp. 1709-23, 2008, ISSN: 0016-6731. (Type: Article | Abstract | Links | BibTeX)
                      Kang, Hyun Min; Ye, Chun; Eskin, Eleazar (2008): Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots.. In: Genetics, 180 (4), pp. 1909-25, 2008, ISSN: 0016-6731. (Type: Article | Abstract | Links | BibTeX)

                      Tags: , , , , , ,

                      « Older entries

                      %d bloggers like this: