Reference-free comparison of microbial communities via de Bruijn graphs

Microbial communities inhabiting the human body exhibit significant variability across different individuals and tissues and are likely play an important role in health and disease. Serghei Mangul and David Koslicki (Oregon State University) recently published a paper presenting a novel approach for characterizing microbial communities in metatranscriptomics studies. Koslicki developed this tool, which may help scientists explore the role microbiota play in disease development, especially when comparing microbiomes of healthy and disease subjects.

Identifying and characterizing the relative abundance of microbiota in different tissues is essential to better understanding the role of microbial communities in human health. Current approaches use reference databases to identify, classify, and compare microbial communities present in the individual host. However, existing databases are incomplete and rely on a limited compendium of reference genomes. Current reference-based approaches are unable to accurately determine microbial compositions to the extent that could be possible given the high resolution of data produced by today’s high throughput sequencing technology.

Framework of the study. For more information, download our paper.

Ideally, comparison of microbial communities across samples could circumvent this limiting classification step. Mangul and Koslicki recently developed EMDeBruijn, a reference-free approach that uses all available non-host microbial reads, not just those classified in reference databases, to compare microbial communities.

First, EMDeBruijn translates sequencing data to a de Bruijn graph, which represents overlaps between symbols in sequences. De Bruijn graphs are commonly used in de novo assembly of short read sequences to a genome, but have not yet been applied in a reference-free approach. EMDeBruijn then uses properties of the de Bruijn graphs to compare microbiome composition across individuals. This metric is reduced using the Earth Mover’s Distance (EMD), a statistic that can measure the distance between two probability distributions over a region.

In their recent paper, Mangul and Koslicki applied EMDeBruijn to study the composition and abundance levels of the microbial communities present in blood samples from coronary artery calcification (CAC) patients and controls. EMDeBruijn uses candidate microbial reads to differentiate between case (CAC-affected) and control (healthy) samples, and a filtered set of non-host reads are used to determine the composition of the blood microbiome. Hierarchical clustering using the EMDeBruijn metric successfully identifies several large clusters unique to samples from either health or control groups.

This study indicates the presence of the disease-specific microbial community structure in CAC patients, and points to the need for additional investigation of potentially causal relationships between the microbiome and CAC disease.

Using the same data set, Mangul and Koslicki compare the results of EMDeBruijn with those of current approaches. Existing computational methods, including MetaPhlAn and RDP’s NBC, discovered various microbial communities across the health and control samples. However, neither of these methods were able to identify any disease-specific patterns in the microbiome nor discriminate the samples into disease and healthy groups.

EMDeBruijn provides a powerful, species independent way to assess microbial diversity across individuals and subjects. For more information, see our paper, which was published in the Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics: http://dl.acm.org/citation.cfm?id=2975174.

Code implementing this method is available at: https://github.com/dkoslicki/EMDeBruijn.

Visualization of the EMDeBruijn Distance. a) Pictorial representation of 2-mer frequencies for two hypothetical samples, S1 and S2. b) The 2-mer frequencies overlaid the de Bruijn graph B2(A ). c) Representation of the flow used to compute EMD2(S1; S2); dark arrows denote mass moved from the initial node to the terminal node. d) Result of applying the flow to the 2-mer frequencies of S1.

This project was a collaboration that started at the Mathematical and Computational Approaches in High-Throughput Genomics program held in Fall 2011 at the Institute of Pure and Applied Mathematics (IPAM). Our on-going Computational Genomics Summer Institute (CGSI; also co-organized by IPAM) was inspired by the 2011 program. Check out the 2017 CGSI website for a preview of this summer’s programs – the deadline for applications is February 1, 2017!

The full citation to our paper is:

Mangul S, Koslicki D. Reference-free comparison of microbial communities via de Bruijn graphs. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2016 Oct 2 (pp. 68-77). Association for Computing Machinery, New York.

Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants

RNA viruses represent the majority of emerging and re-emerging diseases that pose a significant risk to global health – including influenza, hantaviruses, Ebola virus, and Nipah virus. When compared to DNA viruses, RNA viruses have an especially robust adaptability and evolvability due to their high mutation rates and rapid replication cycles. Development of novel medications for the prevention and treatment of these diseases requires an understanding of the mutant variants that drive an RNA-virus’ resistance mechanisms. The long read length offered by single-molecule sequencing technologies allows each mutant variant to be sequenced in a single pass. However, complete profiling of all viral genomes within a mutant spectrum is not yet possible due to the high error rate embedded in analytical protocols.

In collaboration with Alexander Artyomenko (Georgia State University), Alex Zelikovsky (Georgia State University), Nicholas Wu (The Scripps Research Institute), and Ren Sun (UCLA), Serghei Mangul and Eleazar Eskin developed a novel method for accurately reconstructing viral variants from single-molecule reads. This approach, two Single Nucleotide Variants (2SNV), tolerates the high error rate of the single molecule protocol and uses linkage between single nucleotide variations to efficiently distinguish these mutant variations from read errors.

Overview of the 2SNV method. For more information, see our book chapter.

Any method for reconstructing viral variants from single-molecule reads must overcome low volume and high error rate of sequencing data, combined with very high similarity and very low frequency of viral variants. This challenge is similar to extraction of an extremely weak signal from very noisy background with signal-to-noise ratio approaching zero. However impossible this task may seem, a satisfactory solution can be based on distinguishing randomness of the noise from systematic signal repetition. With a high sensitivity and accuracy, 2SNV is anticipated to facilitate not only viral quasispecies reconstruction, but also other biological questions that require detection of rare haplotypes such as genetic diversity in cancer cell population, and monitoring B-cell and T-cell receptor repertoire.

We present 2SNV in a chapter of conference proceedings from the 2016 RECOMB meeting. To benchmark the sensitivity of 2SNV, we performed a single-molecule sequencing experiment on a sample containing a titrated level of known viral mutant variants. We tested 2SNV on a dataset comprised of PacBio reads from 10 independent clones, ranging from 1 to 13 mutations. These 10 clones were mixed at a geometric ratio with two-fold difference in occurrence frequency for consecutive clones starting with the maximum frequency of 50% and the minimum frequency of 0.1 %. Our method is able to accurately reconstruct clone with frequency of 0.2% and distinguish clones that differed in only two nucleotides distantly located on the genome. 2SNV outperforms existing methods for full-length viral mutant reconstruction.

For more information, see our book chapter, which is available for download through Springer Publications: http://link.springer.com/chapter/10.1007%2F978-3-319-31957-5_12.

In addition, the open source implementation of 2SNV, which was developed by Alexander Artyomenko, is freely available for download at http://alan.cs.gsu.edu/NGS/?q=content/2snv.

The full citation to our paper is: 

Artyomenko, Alexander; Wu, Nicholas C; Mangul, Serghei; Eskin, Eleazar; Sun, Ren; Zelikovsky, Alex

Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants Book Chapter

In: Research in Computational Molecular Biology, pp. 164-175, Springer International Publishing, 2016.

Links | BibTeX

Overview of results using the 2SNV method. (a) 2SNV (orange) outperforms existing haplotype reconstruction tools (blue) in viral variant reconstruction. Using PacBio reads from 10 IAV clones, (b) the pairwise edit distance between clones given in a heat-map and (c) occurring frequency of clone types.

Profiling adaptive immune repertoires across multiple human tissues by RNA Sequencing

In a project led by Serghei Mangul, members of our lab recently developed and tested a novel computational method that uses regular RNA-Seq data to rapidly and accurately profile the human immune system. Mangul and his collaborators, including UCLA graduate student Harry (Taegyun) Yang and 2016 B. I. G. Summer undergraduate participants Jeremy Rotman, Benjamin Statz, and Will Van Der Wey, recently published their results in a paper on bioRxiv.

Discoveries in human immunology and advancements in development of treatments for many common human diseases depend on detailed reconstructions of the adaptive immune repertoire. The “adaptive” immune repertoire recognizes pathogens and toxins that the “innate” defense system misses. Assay-based genetic studies provide a detailed view of these adaptive systems by profiling the genetic expression and repertoires of B and T cell receptors. Assay-based approaches have accurately characterized the immune repertoire of peripheral blood.

However, these methods are expensive and smaller in scale when compared to standard RNA sequencing (RNA-seq). Characterizing the immunological repertoires of other tissues, including barrier tissues like skin and mucosae, requires large-scale study. RNA-Seq can capture the entire cellular population of a sample, including B and T cell and their receptors.

ImReP is the first method to efficiently extract B and T cell receptor derived reads from RNA-Seq data, accurately assemble CDR3 sequences, the most variable regions of these receptors, and determine their antigen specificity. Mangul and his team used simulated data to test the feasibility of using RNA-Seq to study the adaptive immune repertoire. ImReP is able to identify 99% CDR3-derived reads from the RNA-Seq mixture, suggesting it is a powerful tool for profiling RNA-Seq samples of immune-related tissues.

They also compared methods and investigated the sequencing depth and read length required to reliably assemble B and T cell receptor sequences from RNA-Seq data. ImReP consistently outperformed existing methods in both recall and precision rates for the majority of simulated parameters. Notably, ImReP was the only method with acceptable performance at 50bp read length, reconstructing with higher precision rate significantly more CDR3 clonotypes.

Mangul and his team applied ImReP to 8,555 samples across 544 individuals from 53 tissues obtained from Genotype-Tissue Expression study (GTEx v6). The data was derived from 38 solid organ tissues, 11 brain subregions, whole blood, and three cell lines. ImRep identified over 26 million reads overlapping 3.8 million distinct CDR3 sequences that originate from diverse human tissues.

Using ImReP, they created a systematic atlas of immunological sequences for B and T cell repertoires across a broad range of tissue types, most of which were not previously studied for B and T cell repertoires. They also examined the compositional similarities of clonal populations between tissues to track the flow of B and T clonotypes across immune-related tissues, including secondary lymphoid and organs encompassing mucosal, exocrine, and endocrine sites.

Advantages of using RNA-Seq to study immune repertoires include the ability to simultaneously capture both B and T cell clonotype populations during a single run, simultaneously detect overall transcriptional responses of the adaptive immune system, and scaling up the atlas of B and T cell receptors that will provide valuable insights into immune responses across various autoimmune diseases, allergies, and cancers.

Read more about ImReP in the full article, which is available for download on bioRxivhttp://biorxiv.org/content/early/2016/11/22/089235.article-metrics

ImReP was created by Igor Mandric and Serghei Mangul. ImReP is freely available at: https://sergheimangul.wordpress.com/imrep/

The atlas of T and B cell receptors, the largest collection of CDR3 sequences and tissue types, is freely available at https://sergheimangul.wordpress.com/atlas-immune-repertoires/. This resource has potential to enhance future studies in areas such as immunology and advance development of therapies for human diseases.

The full citation to our paper is:

Mangul, S., Mandric, I., Yang, H.T., Strauli, N., Montoya, D., Rotman, J., Van Der Wey, W., Ronas, J.R., Statz, B., Zelikovsky, A. and Spreafico, R., 2016. Profiling adaptive immune repertoires across multiple human tissues by RNA Sequencing. bioRxiv, p.089235.

 

Figure 1. Overview of ImReP.

Figure 1. Overview of ImReP. (See full paper for details.)

 

Figure 6. Flow of T and B cell clonotypes across diverse human tissues.

Figure 6. Flow of T and B cell clonotypes across diverse human tissues. (See full paper for details.)