Reference-free comparison of microbial communities via de Bruijn graphs

Microbial communities inhabiting the human body exhibit significant variability across different individuals and tissues and are likely play an important role in health and disease. Serghei Mangul and David Koslicki (Oregon State University) recently published a paper presenting a novel approach for characterizing microbial communities in metatranscriptomics studies. Koslicki developed this tool, which may help scientists explore the role microbiota play in disease development, especially when comparing microbiomes of healthy and disease subjects.

Identifying and characterizing the relative abundance of microbiota in different tissues is essential to better understanding the role of microbial communities in human health. Current approaches use reference databases to identify, classify, and compare microbial communities present in the individual host. However, existing databases are incomplete and rely on a limited compendium of reference genomes. Current reference-based approaches are unable to accurately determine microbial compositions to the extent that could be possible given the high resolution of data produced by today’s high throughput sequencing technology.

Framework of the study. For more information, download our paper.

Ideally, comparison of microbial communities across samples could circumvent this limiting classification step. Mangul and Koslicki recently developed EMDeBruijn, a reference-free approach that uses all available non-host microbial reads, not just those classified in reference databases, to compare microbial communities.

First, EMDeBruijn translates sequencing data to a de Bruijn graph, which represents overlaps between symbols in sequences. De Bruijn graphs are commonly used in de novo assembly of short read sequences to a genome, but have not yet been applied in a reference-free approach. EMDeBruijn then uses properties of the de Bruijn graphs to compare microbiome composition across individuals. This metric is reduced using the Earth Mover’s Distance (EMD), a statistic that can measure the distance between two probability distributions over a region.

In their recent paper, Mangul and Koslicki applied EMDeBruijn to study the composition and abundance levels of the microbial communities present in blood samples from coronary artery calcification (CAC) patients and controls. EMDeBruijn uses candidate microbial reads to differentiate between case (CAC-affected) and control (healthy) samples, and a filtered set of non-host reads are used to determine the composition of the blood microbiome. Hierarchical clustering using the EMDeBruijn metric successfully identifies several large clusters unique to samples from either health or control groups.

This study indicates the presence of the disease-specific microbial community structure in CAC patients, and points to the need for additional investigation of potentially causal relationships between the microbiome and CAC disease.

Using the same data set, Mangul and Koslicki compare the results of EMDeBruijn with those of current approaches. Existing computational methods, including MetaPhlAn and RDP’s NBC, discovered various microbial communities across the health and control samples. However, neither of these methods were able to identify any disease-specific patterns in the microbiome nor discriminate the samples into disease and healthy groups.

EMDeBruijn provides a powerful, species independent way to assess microbial diversity across individuals and subjects. For more information, see our paper, which was published in the Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics:

Code implementing this method is available at:

Visualization of the EMDeBruijn Distance. a) Pictorial representation of 2-mer frequencies for two hypothetical samples, S1 and S2. b) The 2-mer frequencies overlaid the de Bruijn graph B2(A ). c) Representation of the flow used to compute EMD2(S1; S2); dark arrows denote mass moved from the initial node to the terminal node. d) Result of applying the flow to the 2-mer frequencies of S1.

This project was a collaboration that started at the Mathematical and Computational Approaches in High-Throughput Genomics program held in Fall 2011 at the Institute of Pure and Applied Mathematics (IPAM). Our on-going Computational Genomics Summer Institute (CGSI; also co-organized by IPAM) was inspired by the 2011 program. Check out the 2017 CGSI website for a preview of this summer’s programs – the deadline for applications is February 1, 2017!

The full citation to our paper is:

Mangul S, Koslicki D. Reference-free comparison of microbial communities via de Bruijn graphs. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2016 Oct 2 (pp. 68-77). Association for Computing Machinery, New York.

Writing Tips: Why we Publish Methods Papers

by Eleazar Eskin

Computational genomics is a field where many diverse academic groups collaborate, each bringing to a project their own distinct academic cultures.  In particular, each academic discipline involved in computational genomics has its own publication strategy in terms of the types of papers they publish and how they package methods and results in these papers.  Publishing papers is extremely important to careers in academia and science, because all scientists are reviewed for tenure or promotion based on our publications records.  An important factor in our review (unfortunately) is the impact factor of the journals that we publish in.  Here, we describe our lab’s publication strategy and the reasoning behind it.

Our lab is a computational lab, and the main contribution of our lab to Bioinformatics is the development of methods for solving important biological problems, particularly in the area of genetics.  These new methods are implemented in software packages that (hopefully) are used by others to enable biological discovery.  Naturally, the key papers our group produces are papers that describe and explain potential applications of these new methods.

Roughly speaking, there are two strategies for publishing methods in our field.  The first is to focus on writing methods papers that are primarily dedicated to describing the computational advances.  The second is to focus on publishing our novel methods as part of more comprehensive papers that present a biological contribution. In this case, our method is primarily described in the supplementary materials. Over the span of my career, I have seen computational researchers receive more pressure to follow the second strategy in order to have papers published in a high impact journal.  Unfortunately, following the second strategy often delays publication (sometimes for years), because peer review often involves applying the method to a new dataset and/or performing extensive functional validation.

Our group primarily follows the first strategy.  In addition, we work with other groups and, as collaborators, publish papers focused on biological contributions.  This strategy works out well for us, and we feel that writing methods-focused papers is the best way for us to make a contribution to science.  We hope that other computational biology groups will follow our example and publish more methods papers.

Here are some of the reasons we feel this is a good strategy:

  1. Doing Justice to our Work. We can fully explain the methods only in papers dedicated to methodology. Since our contribution is methods, the best way to push the science forward is to clearly describe our method and the context of its development and application. In a dedicated paper, we are most likely to have enough space to fully describe the method and explain how the approach works.  Methods papers also have the space (and are typically required) to compare the proposed method with previous methods. This comparison puts the performance of the paper in perspective to the work of others.  Methods papers ideally provide enough details that other groups can build upon our method and compare their results to our published results. Sharing authorship on these papers also allows students who were involved in the development of these methods to demonstrate their strong technical skills.  In my view, computational biologists should be evaluated by the quality and impact of their methodology development and departments when making hiring decisions should consider this impact.  The impact can be measured by the number of users of the software implementing the methods, the number of citations of the papers describing the methods and the discoveries that these methods have enabled.  These factors are more important than the impact factor of the journals where the methods are published.
  1. Self Determination of Publishing. There are no outside bottlenecks preventing us from finishing our papers quickly, and we can control the publication process of our papers. A methods paper is primarily written by members within our lab, and authors evaluate the method using both simulated and established datasets.  This structure means we need not wait for outside collaborators or experiments to finish.  Finishing the paper faster means that have more time to work on new papers.
  1. Increased Number and Improved Quality of Collaborations. The methods paper is a widely-distributed, often freely available, finished product, and many prospective collaborators approach us after reading a paper from our group. More importantly, in our collaborations, we have very little competition over authorship.  Students in the group are happy to work hard on a project just to be in the middle of the collaborative paper, because they already are first author on their own methods papers.  Our methods development students are not competing for credit with the students in the collaborators group.
  1. Project Longevity. Writing a methods paper forces the method to be finished, evaluated, and documented, and publishing the paper forces us to release the software. This process encourages the project to have more longevity. Once the method is fully developed, new students can easily pick up and build upon the previous method.  Once a student leaves the lab, the method can persist with new lab members as it is stable, well-documented, and de-bugged.  Long after they have left the lab, many of the students who wrote methods papers in our group continue to author papers related to applications of their method.

In full disclosure, we do identify one negative aspect of the methods paper publishing strategy.  High impact papers require collaborations, and it is less likely that methods developers can publish high impact journals as a senior or corresponding authors.  While it is less likely to occur, members of our lab do occasionally gain senior authorship in high impact journals through collaboration.  We have found that the combination of methods papers, where you are the senior or first author, and high impact papers, where you have middle authorship and it is clear that your role was the application of the method, is overall a positive outcome and looks good in your publication record.

For example, Eran Halperin and I published a 2004 paper in the lower-impact journal Bioinformatics that described the HAP haplotype phasing method.  The HAP method was later used in a Perlegen-led paper that was published, with Halperin and I as co-authors, in the notably high-impact journal Science. The 2005 Science paper helped me get my job at UCLA; it was clear what my contribution was as I also authored the methods paper in Bioinformatics.

Our lab has produced several other examples of methods papers paired with high-impact collaborations. Kang et al. (2008) presents the EMMA method in Genetics (impact factor of 5.963), and a collaboration with the Jake Lusis group on the HMDP presents results in Genome Research (impact factor of 11.351) (Bennett et al. 2010).  More recently, we published the CAVIAR method (Hormoziari et al., 2014) in Genetics and collaborated with Dan Geschwind’s group in applying the method to a Nature paper (Won et al. 2016).

Citations of papers mentioned in this post:

Won, Hyejung; de la Torre-Ubieta, Luis; Stein, Jason L; Parikshak, Neelroop N; Huang, Jerry; Opland, Carli K; Gandal, Michael J; Sutton, Gavin J; Hormozdiari, Farhad; Lu, Daning; Lee, Changhoon; Eskin, Eleazar; Voineagu, Irina; Ernst, Jason; Geschwind, Daniel H

Chromosome conformation elucidates regulatory relationships in developing human brain. Journal Article

In: Nature, 538 (7626), pp. 523-527, 2016, ISSN: 1476-4687.

Abstract | Links | BibTeX

Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar

Identifying causal variants at Loci with multiple signals of association. Journal Article

In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631.

Abstract | Links | BibTeX

Bennett, Brian J; Farber, Charles R; Orozco, Luz; Kang, Hyun Min; Ghazalpour, Anatole; Siemers, Nathan; Neubauer, Michael; Neuhaus, Isaac; Yordanova, Roumyana; Guan, Bo; Truong, Amy; Yang, Wen-Pin; He, Aiqing; Kayne, Paul; Gargalovic, Peter; Kirchgessner, Todd; Pan, Calvin; Castellani, Lawrence W; Kostem, Emrah; Furlotte, Nicholas; Drake, Thomas A; Eskin, Eleazar; Lusis, Aldons J

A high-resolution association mapping panel for the dissection of complex traits in mice. Journal Article

In: Genome Res, 20 (2), pp. 281-90, 2010, ISSN: 1549-5469.

Abstract | Links | BibTeX

Kang, Hyun Min; Ye, Chun ; Eskin, Eleazar

Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Journal Article

In: Genetics, 180 (4), pp. 1909-25, 2008, ISSN: 0016-6731.

Abstract | Links | BibTeX

Hinds, David A; Stuve, Laura L; Nilsen, Geoffrey B; Halperin, Eran ; Eskin, Eleazar ; Ballinger, Dennis G; Frazer, Kelly A; Cox, David R

Whole-genome patterns of common DNA variation in three human populations. Journal Article

In: Science, 307 (5712), pp. 1072-9, 2005, ISSN: 1095-9203.

Abstract | Links | BibTeX

Halperin, Eran; Eskin, Eleazar

Haplotype reconstruction from genotype data using Imperfect Phylogeny. Journal Article

In: Bioinformatics, 20 (12), pp. 1842-9, 2004, ISSN: 1367-4803.

Abstract | Links | BibTeX

Colocalization of GWAS and eQTL Signals Detects Target Genes

Farhad Hormozdiari recently developed a method for combining genome-wide association studies (GWASs) and quantitative trait loci (eQTL) studies in a statistical framework that quantifies the probability of each variant to be causal while allowing an arbitrary number of causal variants. Together with collaborators at the University of Oxford and Broad Institute of MIT and Harvard, we present a paper in The American Journal of Human Genetics. Here, we describe eQTL and GWAS CAusal Variants Identification in Associated Regions (eCAVIAR). We apply our approach to datasets from several GWASs and eQTL studies in order to assess its accuracy and potential contributions to colocalization and fine-mapping.

Integrating GWASs and eQTL studies is a promising way to explore the mechanism of non-coding variants on diseases. Integration of GWAS and eQTL data is challenging due to the uncertainty induced by linkage disequilibrium (LD), the non-random association of alleles at different loci, and presence of loci that harbor multiple causal variants (allelic heterogeneity). Current methods assume that each locus contains a single causal variant and expect loci to be independent and associated randomly.

eCAVIAR is a novel probabilistic model for integrating GWAS and eQTL data that extends the CAVIAR (Hormozdiari et al. 2014) framework to explicitly estimate the posterior probability of the same variant being causal in both GWAS and eQTL studies, while accounting for allelic heterogeneity and LD. Our approach can quantify the strength between a causal variant and its associated signals in both studies, and it can be used to colocalize variants that pass the genome-wide significance threshold in GWAS. For any given peak variant identified in GWAS, eCAVIAR considers a collection of variants around that peak variant as one single locus.

We apply eCAVIAR to the Meta-Analyses of Glucose and Insulin-related traits Consortium (MAGIC) dataset and GTEx dataset to detect the target gene and most relevant tissue for each GWAS risk locus. When applied to the MAGIC dataset’s 2 phenotypes, eCAVIAR identifies genetic variants that are causal in both eQTL and GWAS. Further, eCAVIAR detects a large number of loci where the GWAS causal variants are clearly distinct from the causal variants in the eQTL data. Interestingly, eCAVIAR also identifies genes that colocalize in one tissue yet can be excluded in others. For the majority of loci in which we identify a single variant causal for both GWAS and eQTL, eCAVIAR implicates more than one causal variant across the 45 tissues.

We observe that eCAVIAR outperforms existing methods even when there are different values of non-colocalization. Using simulated datasets, we compared accuracy, precision, and recall rate of eCAVIAR to RTC (Nica et al. 2010) and COLOC (Giambartolomei et al. 2014), two current methods for eQTL and GWAS colocalization. Our results show that eCAVIAR has high confidence for selecting loci to be colocalized between the GWAS and eQTL data and is conservative in selecting a locus to be colocalized.

We hope that future applications of eCAVIAR will advance identification of specific GWAS loci that share a causal variant with eQTL studies in a tissue, thus providing insight into presently unclear disease mechanisms.


Overview of eCAVIAR.


eCAVIAR was created by Farhad Hormozdiari, Ayellet V. Segre, Martijn van de Bunt, Xiao Li, Jong Wha J Joo, Michael Bilow, Jae Hoon Sul, Bogdan Pasaniuc and Eleazar Eskin. The article is available at:

Visit the following page to download CAVIAR and eCAVIAR:

The full citation to our paper is:

Hormozdiari, Farhad; van de Bunt, Martijn; Segrè, Ayellet V; Li, Xiao; Joo, Jong Wha J; Bilow, Michael; Sul, Jae Hoon; Sankararaman, Sriram; Pasaniuc, Bogdan; Eskin, Eleazar

Colocalization of GWAS and eQTL Signals Detects Target Genes. Journal Article

In: Am J Hum Genet, 2016, ISSN: 1537-6605.

Abstract | Links | BibTeX

Our paper builds upon a method introduced in a previous publication:

Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar

Identifying causal variants at Loci with multiple signals of association. Journal Article

In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631.

Abstract | Links | BibTeX