Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants

RNA viruses represent the majority of emerging and re-emerging diseases that pose a significant risk to global health – including influenza, hantaviruses, Ebola virus, and Nipah virus. When compared to DNA viruses, RNA viruses have an especially robust adaptability and evolvability due to their high mutation rates and rapid replication cycles. Development of novel medications for the prevention and treatment of these diseases requires an understanding of the mutant variants that drive an RNA-virus’ resistance mechanisms. The long read length offered by single-molecule sequencing technologies allows each mutant variant to be sequenced in a single pass. However, complete profiling of all viral genomes within a mutant spectrum is not yet possible due to the high error rate embedded in analytical protocols.

In collaboration with Alexander Artyomenko (Georgia State University), Alex Zelikovsky (Georgia State University), Nicholas Wu (The Scripps Research Institute), and Ren Sun (UCLA), Serghei Mangul and Eleazar Eskin developed a novel method for accurately reconstructing viral variants from single-molecule reads. This approach, two Single Nucleotide Variants (2SNV), tolerates the high error rate of the single molecule protocol and uses linkage between single nucleotide variations to efficiently distinguish these mutant variations from read errors.

Overview of the 2SNV method. For more information, see our book chapter.

Any method for reconstructing viral variants from single-molecule reads must overcome low volume and high error rate of sequencing data, combined with very high similarity and very low frequency of viral variants. This challenge is similar to extraction of an extremely weak signal from very noisy background with signal-to-noise ratio approaching zero. However impossible this task may seem, a satisfactory solution can be based on distinguishing randomness of the noise from systematic signal repetition. With a high sensitivity and accuracy, 2SNV is anticipated to facilitate not only viral quasispecies reconstruction, but also other biological questions that require detection of rare haplotypes such as genetic diversity in cancer cell population, and monitoring B-cell and T-cell receptor repertoire.

We present 2SNV in a chapter of conference proceedings from the 2016 RECOMB meeting. To benchmark the sensitivity of 2SNV, we performed a single-molecule sequencing experiment on a sample containing a titrated level of known viral mutant variants. We tested 2SNV on a dataset comprised of PacBio reads from 10 independent clones, ranging from 1 to 13 mutations. These 10 clones were mixed at a geometric ratio with two-fold difference in occurrence frequency for consecutive clones starting with the maximum frequency of 50% and the minimum frequency of 0.1 %. Our method is able to accurately reconstruct clone with frequency of 0.2% and distinguish clones that differed in only two nucleotides distantly located on the genome. 2SNV outperforms existing methods for full-length viral mutant reconstruction.

For more information, see our book chapter, which is available for download through Springer Publications: http://link.springer.com/chapter/10.1007%2F978-3-319-31957-5_12.

In addition, the open source implementation of 2SNV, which was developed by Alexander Artyomenko, is freely available for download at http://alan.cs.gsu.edu/NGS/?q=content/2snv.

The full citation to our paper is: 

Artyomenko, Alexander; Wu, Nicholas C; Mangul, Serghei; Eskin, Eleazar; Sun, Ren; Zelikovsky, Alex

Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants Book Chapter

In: Research in Computational Molecular Biology, pp. 164-175, Springer International Publishing, 2016.

Links | BibTeX

Overview of results using the 2SNV method. (a) 2SNV (orange) outperforms existing haplotype reconstruction tools (blue) in viral variant reconstruction. Using PacBio reads from 10 IAV clones, (b) the pairwise edit distance between clones given in a heat-map and (c) occurring frequency of clone types.

Profiling adaptive immune repertoires across multiple human tissues by RNA Sequencing

In a project led by Serghei Mangul, members of our lab recently developed and tested a novel computational method that uses regular RNA-Seq data to rapidly and accurately profile the human immune system. Mangul and his collaborators, including UCLA graduate student Harry (Taegyun) Yang and 2016 B. I. G. Summer undergraduate participants Jeremy Rotman, Benjamin Statz, and Will Van Der Wey, recently published their results in a paper on bioRxiv.

Discoveries in human immunology and advancements in development of treatments for many common human diseases depend on detailed reconstructions of the adaptive immune repertoire. The “adaptive” immune repertoire recognizes pathogens and toxins that the “innate” defense system misses. Assay-based genetic studies provide a detailed view of these adaptive systems by profiling the genetic expression and repertoires of B and T cell receptors. Assay-based approaches have accurately characterized the immune repertoire of peripheral blood.

However, these methods are expensive and smaller in scale when compared to standard RNA sequencing (RNA-seq). Characterizing the immunological repertoires of other tissues, including barrier tissues like skin and mucosae, requires large-scale study. RNA-Seq can capture the entire cellular population of a sample, including B and T cell and their receptors.

ImReP is the first method to efficiently extract B and T cell receptor derived reads from RNA-Seq data, accurately assemble CDR3 sequences, the most variable regions of these receptors, and determine their antigen specificity. Mangul and his team used simulated data to test the feasibility of using RNA-Seq to study the adaptive immune repertoire. ImReP is able to identify 99% CDR3-derived reads from the RNA-Seq mixture, suggesting it is a powerful tool for profiling RNA-Seq samples of immune-related tissues.

They also compared methods and investigated the sequencing depth and read length required to reliably assemble B and T cell receptor sequences from RNA-Seq data. ImReP consistently outperformed existing methods in both recall and precision rates for the majority of simulated parameters. Notably, ImReP was the only method with acceptable performance at 50bp read length, reconstructing with higher precision rate significantly more CDR3 clonotypes.

Mangul and his team applied ImReP to 8,555 samples across 544 individuals from 53 tissues obtained from Genotype-Tissue Expression study (GTEx v6). The data was derived from 38 solid organ tissues, 11 brain subregions, whole blood, and three cell lines. ImRep identified over 26 million reads overlapping 3.8 million distinct CDR3 sequences that originate from diverse human tissues.

Using ImReP, they created a systematic atlas of immunological sequences for B and T cell repertoires across a broad range of tissue types, most of which were not previously studied for B and T cell repertoires. They also examined the compositional similarities of clonal populations between tissues to track the flow of B and T clonotypes across immune-related tissues, including secondary lymphoid and organs encompassing mucosal, exocrine, and endocrine sites.

Advantages of using RNA-Seq to study immune repertoires include the ability to simultaneously capture both B and T cell clonotype populations during a single run, simultaneously detect overall transcriptional responses of the adaptive immune system, and scaling up the atlas of B and T cell receptors that will provide valuable insights into immune responses across various autoimmune diseases, allergies, and cancers.

Read more about ImReP in the full article, which is available for download on bioRxivhttp://biorxiv.org/content/early/2016/11/22/089235.article-metrics

ImReP was created by Igor Mandric and Serghei Mangul. ImReP is freely available at: https://sergheimangul.wordpress.com/imrep/

The atlas of T and B cell receptors, the largest collection of CDR3 sequences and tissue types, is freely available at https://sergheimangul.wordpress.com/atlas-immune-repertoires/. This resource has potential to enhance future studies in areas such as immunology and advance development of therapies for human diseases.

The full citation to our paper is:

Mangul, S., Mandric, I., Yang, H.T., Strauli, N., Montoya, D., Rotman, J., Van Der Wey, W., Ronas, J.R., Statz, B., Zelikovsky, A. and Spreafico, R., 2016. Profiling adaptive immune repertoires across multiple human tissues by RNA Sequencing. bioRxiv, p.089235.


Figure 1. Overview of ImReP.

Figure 1. Overview of ImReP. (See full paper for details.)


Figure 6. Flow of T and B cell clonotypes across diverse human tissues.

Figure 6. Flow of T and B cell clonotypes across diverse human tissues. (See full paper for details.)


Discovering SNPs Regulating Human Gene Expression Using Allele Specific Expression from RNA-Seq data

Analyses of expression quantitative trait loci (eQTL), genomic loci that contribute to variation in genetic expression levels, are essential to understanding the mechanisms of human disease. These studies identify regulators of gene expression as either cis-acting factors that regulate nearby genes, or trans-acting factors that affect unlinked genes through various functions.  Traditional eQTL studies treat expression as a quantitative trait and associate it with genetic variation. This approach has identified many loci involved in the genetic regulation of common, complex diseases.

Standard eQTL methods are limited in power and accuracy by several phenomena common to genomic datasets. First, the correlation structure of genetic variation in the genome, known as linkage disequilibrium (LD), limits the ability of these methods to differentiate between the regulatory variant and neighboring variants that are in LD. Second, like other quantitative traits, the total expression of a gene is influenced by multiple genetic and environmental factors. The effect size for any given variant is therefore small, and standard methods require a large sample size to identify the effect.


ASE example and corresponding mathematical representation of three individuals (1, 2, 3). We assume that the third SNP is the causal SNP site affecting the differential gene expression level (Allele A/ Allele T).

Our forthcoming paper in Genetics presents a new method that improves the accuracy and computational power of eQTL mapping with incorporation of allele specific expression (ASE) analysis. Our novel method uses genome sequencing, alongside measurements of ASE from RNA-seq data, to identify cis-acting regulatory variants.

In standard eQTLs studies, the analysis of ASE is influenced by LD structure and the amount of allelic heterogeneity present in the genome. Individual effects appear weak since the effect of a variant is modest when compared to the variance of total expression. In our approach, the genotypes of each single individual with ASE provides information useful to determining variants causal for the observed ASE. Our approach actually leverages the relationship between LD and variant identification to map the variants affecting expression. Thus, analysis of ASE is advantageous over analysis of total expression levels, the standard approach to eQTL mapping.

We demonstrate the utility of our method by analyzing RNA-seq data from 77 unrelated northern and western European individuals (CEU). To map each gene, we simultaneously compare ASE measurements across a set of sequenced individuals. We then identify genetic variants that are in proximity to those genes and capable of explaining observed patterns of ASE. Here, we characterize the efficacy of this method as the ratio termed “reduction rate” and denoted as the ratio between the number of candidate regulatory SNPs to the total number of SNPs in the proximal region of the gene.

When applied to the CEU dataset, our method reduced the set of candidate SNPs from ten to two (a reduction rate of 80%). Allowing for one error increases the number of candidate SNPs to five and decreases the reduction rate to 50%. We also observe that the relationship between LD and variant identification has a different quality in ASE mapping when compared to eQTL studies, and produces different types of information useful to eQTL mapping studies.

ASE studies are a powerful approach to identifying associations between genetic variation and gene expression. Accurate measurement of ASE can identify cis-acting regulatory variants associated with common diseases. Our novel method for ASE mapping is based on a robust and computationally efficient non-parametric approach, and we hope it advances our understanding of functional risk alleles and facilitates development of new hypotheses for the causes and treatment of common diseases.

This project used software developed by Jennifer Zou, which is available for download at: http://genetics.cs.ucla.edu/ase/

This project was led by Eun Yong Kang and involved Serghei Mangul, Buhm Han, and Sagiv Shifman. The article is available at: http://www.genetics.org/content/204/3/1057

The full citation to our paper is:

Kang, Eun Yong; Martin, Lisa; Mangul, Serghei; Isvilanonda, Warin; Zou, Jennifer; Ben-David, Eyal; Han, Buhm; Lusis, Aldons J; Shifman, Sagiv; Eskin, Eleazar

Discovering SNPs Regulating Human Gene Expression Using Allele Specific Expression from RNA-Seq Data. Journal Article

In: Genetics, 2016, ISSN: 1943-2631.

Abstract | Links | BibTeX