Reference-free comparison of microbial communities via de Bruijn graphs

Microbial communities inhabiting the human body exhibit significant variability across different individuals and tissues and are likely play an important role in health and disease. Serghei Mangul and David Koslicki (Oregon State University) recently published a paper presenting a novel approach for characterizing microbial communities in metatranscriptomics studies. Koslicki developed this tool, which may help scientists explore the role microbiota play in disease development, especially when comparing microbiomes of healthy and disease subjects.

Identifying and characterizing the relative abundance of microbiota in different tissues is essential to better understanding the role of microbial communities in human health. Current approaches use reference databases to identify, classify, and compare microbial communities present in the individual host. However, existing databases are incomplete and rely on a limited compendium of reference genomes. Current reference-based approaches are unable to accurately determine microbial compositions to the extent that could be possible given the high resolution of data produced by today’s high throughput sequencing technology.

Framework of the study. For more information, download our paper.

Ideally, comparison of microbial communities across samples could circumvent this limiting classification step. Mangul and Koslicki recently developed EMDeBruijn, a reference-free approach that uses all available non-host microbial reads, not just those classified in reference databases, to compare microbial communities.

First, EMDeBruijn translates sequencing data to a de Bruijn graph, which represents overlaps between symbols in sequences. De Bruijn graphs are commonly used in de novo assembly of short read sequences to a genome, but have not yet been applied in a reference-free approach. EMDeBruijn then uses properties of the de Bruijn graphs to compare microbiome composition across individuals. This metric is reduced using the Earth Mover’s Distance (EMD), a statistic that can measure the distance between two probability distributions over a region.

In their recent paper, Mangul and Koslicki applied EMDeBruijn to study the composition and abundance levels of the microbial communities present in blood samples from coronary artery calcification (CAC) patients and controls. EMDeBruijn uses candidate microbial reads to differentiate between case (CAC-affected) and control (healthy) samples, and a filtered set of non-host reads are used to determine the composition of the blood microbiome. Hierarchical clustering using the EMDeBruijn metric successfully identifies several large clusters unique to samples from either health or control groups.

This study indicates the presence of the disease-specific microbial community structure in CAC patients, and points to the need for additional investigation of potentially causal relationships between the microbiome and CAC disease.

Using the same data set, Mangul and Koslicki compare the results of EMDeBruijn with those of current approaches. Existing computational methods, including MetaPhlAn and RDP’s NBC, discovered various microbial communities across the health and control samples. However, neither of these methods were able to identify any disease-specific patterns in the microbiome nor discriminate the samples into disease and healthy groups.

EMDeBruijn provides a powerful, species independent way to assess microbial diversity across individuals and subjects. For more information, see our paper, which was published in the Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics:

Code implementing this method is available at:

Visualization of the EMDeBruijn Distance. a) Pictorial representation of 2-mer frequencies for two hypothetical samples, S1 and S2. b) The 2-mer frequencies overlaid the de Bruijn graph B2(A ). c) Representation of the flow used to compute EMD2(S1; S2); dark arrows denote mass moved from the initial node to the terminal node. d) Result of applying the flow to the 2-mer frequencies of S1.

This project was a collaboration that started at the Mathematical and Computational Approaches in High-Throughput Genomics program held in Fall 2011 at the Institute of Pure and Applied Mathematics (IPAM). Our on-going Computational Genomics Summer Institute (CGSI; also co-organized by IPAM) was inspired by the 2011 program. Check out the 2017 CGSI website for a preview of this summer’s programs – the deadline for applications is February 1, 2017!

The full citation to our paper is:

Mangul S, Koslicki D. Reference-free comparison of microbial communities via de Bruijn graphs. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2016 Oct 2 (pp. 68-77). Association for Computing Machinery, New York.

Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants

RNA viruses represent the majority of emerging and re-emerging diseases that pose a significant risk to global health – including influenza, hantaviruses, Ebola virus, and Nipah virus. When compared to DNA viruses, RNA viruses have an especially robust adaptability and evolvability due to their high mutation rates and rapid replication cycles. Development of novel medications for the prevention and treatment of these diseases requires an understanding of the mutant variants that drive an RNA-virus’ resistance mechanisms. The long read length offered by single-molecule sequencing technologies allows each mutant variant to be sequenced in a single pass. However, complete profiling of all viral genomes within a mutant spectrum is not yet possible due to the high error rate embedded in analytical protocols.

In collaboration with Alexander Artyomenko (Georgia State University), Alex Zelikovsky (Georgia State University), Nicholas Wu (The Scripps Research Institute), and Ren Sun (UCLA), Serghei Mangul and Eleazar Eskin developed a novel method for accurately reconstructing viral variants from single-molecule reads. This approach, two Single Nucleotide Variants (2SNV), tolerates the high error rate of the single molecule protocol and uses linkage between single nucleotide variations to efficiently distinguish these mutant variations from read errors.

Overview of the 2SNV method. For more information, see our book chapter.

Any method for reconstructing viral variants from single-molecule reads must overcome low volume and high error rate of sequencing data, combined with very high similarity and very low frequency of viral variants. This challenge is similar to extraction of an extremely weak signal from very noisy background with signal-to-noise ratio approaching zero. However impossible this task may seem, a satisfactory solution can be based on distinguishing randomness of the noise from systematic signal repetition. With a high sensitivity and accuracy, 2SNV is anticipated to facilitate not only viral quasispecies reconstruction, but also other biological questions that require detection of rare haplotypes such as genetic diversity in cancer cell population, and monitoring B-cell and T-cell receptor repertoire.

We present 2SNV in a chapter of conference proceedings from the 2016 RECOMB meeting. To benchmark the sensitivity of 2SNV, we performed a single-molecule sequencing experiment on a sample containing a titrated level of known viral mutant variants. We tested 2SNV on a dataset comprised of PacBio reads from 10 independent clones, ranging from 1 to 13 mutations. These 10 clones were mixed at a geometric ratio with two-fold difference in occurrence frequency for consecutive clones starting with the maximum frequency of 50% and the minimum frequency of 0.1 %. Our method is able to accurately reconstruct clone with frequency of 0.2% and distinguish clones that differed in only two nucleotides distantly located on the genome. 2SNV outperforms existing methods for full-length viral mutant reconstruction.

For more information, see our book chapter, which is available for download through Springer Publications:

In addition, the open source implementation of 2SNV, which was developed by Alexander Artyomenko, is freely available for download at

The full citation to our paper is: 

Artyomenko, Alexander; Wu, Nicholas C; Mangul, Serghei; Eskin, Eleazar; Sun, Ren; Zelikovsky, Alex

Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants Book Chapter

In: Research in Computational Molecular Biology, pp. 164-175, Springer International Publishing, 2016.

Links | BibTeX

Overview of results using the 2SNV method. (a) 2SNV (orange) outperforms existing haplotype reconstruction tools (blue) in viral variant reconstruction. Using PacBio reads from 10 IAV clones, (b) the pairwise edit distance between clones given in a heat-map and (c) occurring frequency of clone types.

HapIso: An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads

Recent advances in RNA sequencing technology can generate deep coverage data containing millions of reads. RNA-Seq data are used to identify genetic variants and alternatively spliced isoforms, a common mechanism for diversity in a gene, that may play a role in heritable traits and diseases. Using this type of data, connections can be drawn between genetic expression and one of the two parental haplotypes identified in a diploid organism’s transcript. In other words, we can potentially identify the parent from which an individual inherited a group of genes.

These multi-kilobase reads are longer than most transcripts and enable sequencing of complete haplotype isoforms. New computational methods are required for efficient analysis of this highly complex data. In a recent paper, we present HapIso (Haplotype-specific Isoform Reconstruction), a comprehensive method that can accurately reconstruct the haplotype-specific isoforms of a diploid cell. Our software package is the first method capable of reconstructing the haplotype-specific isoforms from long single-molecule reads.

HapIso uses splice mapping of long single-molecule reads to partition reads into two parental haplotypes. The single molecule reads entirely span the RNA transcripts and bridge the single nucleotide variation (SNV) loci across a single gene. To overcome gapped coverage and splicing structures of the gene, the haplotype reconstruction procedure is applied independently to regions of contiguous coverage that have been defined as transcribed segments. Restricted reads from the transcribed regions are partitioned into two local clusters using the 2-mean clustering. Using the linkage provided by the long single-molecule reads, we connect the local clusters into two global clusters. An error-correction protocol is then applied for the reads from the same cluster.

Discriminating the long reads into parental haplotypes allows HapIso to accurately calculate allele-specific gene expression and identify imprinted genes. Additionally, it has a potential to improve detection of the effect of cis– and trans-regulatory changes on gene expression regulation. Long reads allow access to genetic variation in regions previously unreachable by short read protocols and potentially lead to new insights in disease heritability.

We applied HapIso to publicly available single-molecule RNA-Seq data from the GM12878 cell line and circular-consensus (CCS) single-molecule reads generated by Pacific Biosciences platform. Our method discovered novel SNVs in regions that were previously unreachable by standard short read protocols, 53% of which follow Mendelian inheritance. HapIso detected 921 genes with both haplotypes expressed among 9,000 expressed genes. We observed 4,140 heterozygous loci corresponding to positions with non-identical alleles among inferred haplotypes. Additionally, we can theoretically identify recombinations in the transmitted haplotypes by checking the number of recombinations in the inferred haplotypes.

The open source Python implementation of HapIso was developed by Serghei Mangul and Harry (Taegyun) Yang, and the software package is freely available for download at

This paper appears in Proceedings of the International Symposium on Bioinformatics Research and Applications (ISBRA-2016), which can be downloaded here:

Serghei Mangul and Harry Yang led this project, which involved Farhad Hormozdiari. The full citation to our paper is:

Mangul, Serghei ; Yang, Harry ; Hormozdiari, Farhad ; Tseng, Elizabeth ; Zelikovsky, Alex ; Eskin, Eleazar

HapIso: An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads Book Chapter

In: Bioinformatics Research and Applications, pp. 80-92, Springer International Publishing, 2016.

Links | BibTeX


Overview of HapIso.