Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants

RNA viruses represent the majority of emerging and re-emerging diseases that pose a significant risk to global health – including influenza, hantaviruses, Ebola virus, and Nipah virus. When compared to DNA viruses, RNA viruses have an especially robust adaptability and evolvability due to their high mutation rates and rapid replication cycles. Development of novel medications for the prevention and treatment of these diseases requires an understanding of the mutant variants that drive an RNA-virus’ resistance mechanisms. The long read length offered by single-molecule sequencing technologies allows each mutant variant to be sequenced in a single pass. However, complete profiling of all viral genomes within a mutant spectrum is not yet possible due to the high error rate embedded in analytical protocols.

In collaboration with Alexander Artyomenko (Georgia State University), Alex Zelikovsky (Georgia State University), Nicholas Wu (The Scripps Research Institute), and Ren Sun (UCLA), Serghei Mangul and Eleazar Eskin developed a novel method for accurately reconstructing viral variants from single-molecule reads. This approach, two Single Nucleotide Variants (2SNV), tolerates the high error rate of the single molecule protocol and uses linkage between single nucleotide variations to efficiently distinguish these mutant variations from read errors.

Overview of the 2SNV method. For more information, see our book chapter.

Any method for reconstructing viral variants from single-molecule reads must overcome low volume and high error rate of sequencing data, combined with very high similarity and very low frequency of viral variants. This challenge is similar to extraction of an extremely weak signal from very noisy background with signal-to-noise ratio approaching zero. However impossible this task may seem, a satisfactory solution can be based on distinguishing randomness of the noise from systematic signal repetition. With a high sensitivity and accuracy, 2SNV is anticipated to facilitate not only viral quasispecies reconstruction, but also other biological questions that require detection of rare haplotypes such as genetic diversity in cancer cell population, and monitoring B-cell and T-cell receptor repertoire.

We present 2SNV in a chapter of conference proceedings from the 2016 RECOMB meeting. To benchmark the sensitivity of 2SNV, we performed a single-molecule sequencing experiment on a sample containing a titrated level of known viral mutant variants. We tested 2SNV on a dataset comprised of PacBio reads from 10 independent clones, ranging from 1 to 13 mutations. These 10 clones were mixed at a geometric ratio with two-fold difference in occurrence frequency for consecutive clones starting with the maximum frequency of 50% and the minimum frequency of 0.1 %. Our method is able to accurately reconstruct clone with frequency of 0.2% and distinguish clones that differed in only two nucleotides distantly located on the genome. 2SNV outperforms existing methods for full-length viral mutant reconstruction.

For more information, see our book chapter, which is available for download through Springer Publications: http://link.springer.com/chapter/10.1007%2F978-3-319-31957-5_12.

In addition, the open source implementation of 2SNV, which was developed by Alexander Artyomenko, is freely available for download at http://alan.cs.gsu.edu/NGS/?q=content/2snv.

The full citation to our paper is: 

Artyomenko, Alexander; Wu, Nicholas C; Mangul, Serghei; Eskin, Eleazar; Sun, Ren; Zelikovsky, Alex

Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants Book Chapter

In: Research in Computational Molecular Biology, pp. 164-175, Springer International Publishing, 2016.

Links | BibTeX

Overview of results using the 2SNV method. (a) 2SNV (orange) outperforms existing haplotype reconstruction tools (blue) in viral variant reconstruction. Using PacBio reads from 10 IAV clones, (b) the pairwise edit distance between clones given in a heat-map and (c) occurring frequency of clone types.

Total RNA Sequencing reveals microbial communities in human blood and disease specific effects

Together with researchers from UC San Francisco, UC Davis, Oregon State University, and the Netherlands, our group recently published a paper on bioRxiv that presents a new method capable of identifying different microbes present in human blood. Our paper is featured in the Stanford University digest of microbiome papers, as well as an international science and technology news source. Here, we demonstrate the potential use of total RNA to study the relationship of specific diseases with microbes that inhabit the human body.

A growing body of evidence suggests that the human microbiome plays an important role in health and disease. In order to investigate the specific ways that microbes may influence disease development, we developed a novel ‘lost and found’ pipeline. Here, we use whole blood RNA sequencing (RNA-Seq) reads to detect a variety of microbial organisms. Our ‘lost and found’ pipeline utilizes high quality reads that fail to map to the human genome as candidate microbial reads. Since RNA-Seq has become a widely used technology in recent years with many large datasets available, we believe that our pipeline has great potential for application across tissues and disease types.

We applied our ‘lost and found’ pipeline to study the composition of blood microbiome in almost two hundred individuals, including healthy control individuals and patients with schizophrenia, bipolar disorder, and amyotrophic lateral sclerosis (also known as ALS or Lou Gehrig’s disease). Using this pipeline, we detected bacterial and archaeal phyla in blood using RNA sequencing (RNA-Seq) data. Our analyses of these data, including examination of positive and negative control datasets, suggest that detected phyla are in fact representative of the actual microbial communities in the individuals’ blood.

In total, we observed 23 distinct microbial phyla with on average 4.1 ± 2.0 phyla per individual. Phylogenetic classification is performed using Phylosift, which assigns the filtered candidate microbial reads to the microbial genes from 23 distinct taxa on the phylum level. The large majority of taxa that were observed in our sample are not universally present in all individuals, except for Proteobacteria that dominate all samples with 73.4% ± 18.3% relative abundance (dark green color). Here, we can see the genomic abundances of microbial taxa at phylum level of classification for each of the four groups:
Further, in comparison to individuals in the other three groups, we observed a significantly increased microbial diversity in blood samples from individuals with schizophrenia. We replicated this finding with an independent schizophrenia case-control study. The increased microbial diversity observed in schizophrenia could be part of the disease etiology (i.e., causing schizophrenia) or may be a secondary effect of disease status. In the absence of a direct link with genetic susceptibility and the reported correlation with the immune system, we hypothesize that the observed effect in schizophrenia is secondary to disease. This phenomena may be a consequence of lifestyle differences of schizophrenia patients, including cigarette smoking, drug use, or other environmental exposures. Future targeted and/or longitudinal studies with larger sample sizes, detailed clinical phenotypes, and more in-depth sequencing are needed to corroborate this hypothesis.

We hope that our finding of increased diversity in schizophrenia will ultimately lead to a better understanding of the functional mechanisms underlying the connection between immune system, blood microbiome, and disease etiology. With the increasing availability of large scale RNA-Seq datasets collected from different phenotypes and tissue types, we anticipate that the application of our ‘lost and found’ pipeline will lead to the generation of a range of novel hypotheses, ultimately aiding our understanding of the role of the microbiome in health and disease.

This project was led by Serghei Mangul and Loes Olde Loohuis (Roel Ophoff group). This was a joint project with the Roel Ophoff group at Center for Neurobehavioral Genetics at the Semel Institute for Neuroscience and Human Behavior, University California, Los Angeles, CA, USA.

The article is available at: http://biorxiv.org/content/early/2016/06/07/057570.

The full citation to our paper is:

Mangul, Serghei; Loohuis, Loes Olde M; Ori, Anil; Jospin, Guillaume; Koslicki, David; Yang, Harry Taegyun; Wu, Timothy; Boks, Marco P; Lomen-Hoerth, Catherine; Wiedau-Pazos, Martina; Cantor, Rita; de Vos, Willem M; Kahn, Rene S; Eskin, Eleazar; Ophoff, Roel A

Total RNA Sequencing reveals microbial communities in human blood and disease specific effects. Journal Article

In: BioRxiv, (057570), 2016.

Abstract | Links | BibTeX

Dumpster diving in RNA-seq to find the source of every last read

Our group recently developed the Read Origin Protocol (ROP) method to discover the source of all reads in an RNA-seq experiment. Reads originate from complex RNA molecules, recombinant antibodies and microbial communities. ROP accounts for 98.8% of all reads across poly(A) and ribo-depletion protocols, compared to 83.8% by conventional reference-based protocols. We find that the vast majority of unmapped reads are human in origin and originate from diverse sources, including repetitive elements, non-co-linear elements or recombined B and T cell receptors (BCR/TCR). In addition to human RNA, a large number of reads were microbial in origin, often occurring in sufficient numbers to study the taxonomic composition of microbial communities.




The majority of RNA-Seq analyses begin by mapping each experimentally produced sequence (i.e., read) to a set of annotated reference sequences for the organism of interest. For both biological and technical reasons, a significant fraction of reads remains unmapped. Our study is the first that systematically accounts for almost all reads in RNA-seq studies. We demonstrate the value of analyzing unmapped reads present in the RNA-seq data to better understand the functional mechanisms underlying the connection between immune system, microbiome, human gene expression, and disease etiology.

We applied our method to to RNA-seq data from 53 asthmatic cases and 33 controls collected from three tissues, using both poly(A) selection and ribo-depletion libraries. Using the ROP pipeline we show that immune profiles of asthmatic individuals are significantly different from the controls with decreased T-cell/B-cell receptor diversity and that immune diversity is inversely correlated with microbial load. This case study highlights the potential for novel discoveries without additional TCR/BCR or microbiome sequencing when the information in RNA-seq data is fully leveraged by incorporating the analysis of unmapped reads.

The ROP can not only help researchers make the best use of sequencing data, but will also enable additional scientific questions to be answered with no additional cost. For example, one can now interrogate additional features of the immune system without additional expensive TCR/BCR sequencing. The ‘dumpster diving’ profile of unmapped reads output by our method is not limited to RNA-Seq technology and may be applied to whole-exome and whole-genome sequencing. We anticipate that ‘dumpster diving’ profiling will find broad future applications in studies involving different tissue and disease types.

This project was led by Serghei Mangul and involved Harry Yang (Taegyun), both of whom developed the protocol as open source software. This was a joint project with the Noah Zaitlen group (http://zaitlenlab.ucsf.edu/) at University of California, San Francisco.

ROP is available at https://sergheimangul.wordpress.com/rop/.

The article is available at: http://biorxiv.org/content/early/2016/05/13/053041.

The full citation to our paper is:

Mangul, Serghei; Yang, Harry Taegyun; Strauli, Nicolas; Gruhl, Franziska; Daley, Timothy; Christenson, Stephanie; Andersen, Agata Wesolowska; Spreafico, Roberto; Rios, Cydney; Eng, Celeste; Smith, Andrew D; Hernandez, Ryan D; Ophoff, Roel A; Santana, Jose Rodriguez; Woodruff, Prescott G; Burchard, Esteban; Seibold, Max A; Shifman, Sagiv; Eskin, Eleazar; Zaitlen, Noah

Dumpster diving in RNA-sequencing to find the source of every last read. Journal Article

In: BioRxiv, 2016.

Links | BibTeX