Reference-free comparison of microbial communities via de Bruijn graphs

Microbial communities inhabiting the human body exhibit significant variability across different individuals and tissues and are likely play an important role in health and disease. Serghei Mangul and David Koslicki (Oregon State University) recently published a paper presenting a novel approach for characterizing microbial communities in metatranscriptomics studies. Koslicki developed this tool, which may help scientists explore the role microbiota play in disease development, especially when comparing microbiomes of healthy and disease subjects.

Identifying and characterizing the relative abundance of microbiota in different tissues is essential to better understanding the role of microbial communities in human health. Current approaches use reference databases to identify, classify, and compare microbial communities present in the individual host. However, existing databases are incomplete and rely on a limited compendium of reference genomes. Current reference-based approaches are unable to accurately determine microbial compositions to the extent that could be possible given the high resolution of data produced by today’s high throughput sequencing technology.

Framework of the study. For more information, download our paper.

Ideally, comparison of microbial communities across samples could circumvent this limiting classification step. Mangul and Koslicki recently developed EMDeBruijn, a reference-free approach that uses all available non-host microbial reads, not just those classified in reference databases, to compare microbial communities.

First, EMDeBruijn translates sequencing data to a de Bruijn graph, which represents overlaps between symbols in sequences. De Bruijn graphs are commonly used in de novo assembly of short read sequences to a genome, but have not yet been applied in a reference-free approach. EMDeBruijn then uses properties of the de Bruijn graphs to compare microbiome composition across individuals. This metric is reduced using the Earth Mover’s Distance (EMD), a statistic that can measure the distance between two probability distributions over a region.

In their recent paper, Mangul and Koslicki applied EMDeBruijn to study the composition and abundance levels of the microbial communities present in blood samples from coronary artery calcification (CAC) patients and controls. EMDeBruijn uses candidate microbial reads to differentiate between case (CAC-affected) and control (healthy) samples, and a filtered set of non-host reads are used to determine the composition of the blood microbiome. Hierarchical clustering using the EMDeBruijn metric successfully identifies several large clusters unique to samples from either health or control groups.

This study indicates the presence of the disease-specific microbial community structure in CAC patients, and points to the need for additional investigation of potentially causal relationships between the microbiome and CAC disease.

Using the same data set, Mangul and Koslicki compare the results of EMDeBruijn with those of current approaches. Existing computational methods, including MetaPhlAn and RDP’s NBC, discovered various microbial communities across the health and control samples. However, neither of these methods were able to identify any disease-specific patterns in the microbiome nor discriminate the samples into disease and healthy groups.

EMDeBruijn provides a powerful, species independent way to assess microbial diversity across individuals and subjects. For more information, see our paper, which was published in the Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics:

Code implementing this method is available at:

Visualization of the EMDeBruijn Distance. a) Pictorial representation of 2-mer frequencies for two hypothetical samples, S1 and S2. b) The 2-mer frequencies overlaid the de Bruijn graph B2(A ). c) Representation of the flow used to compute EMD2(S1; S2); dark arrows denote mass moved from the initial node to the terminal node. d) Result of applying the flow to the 2-mer frequencies of S1.

This project was a collaboration that started at the Mathematical and Computational Approaches in High-Throughput Genomics program held in Fall 2011 at the Institute of Pure and Applied Mathematics (IPAM). Our on-going Computational Genomics Summer Institute (CGSI; also co-organized by IPAM) was inspired by the 2011 program. Check out the 2017 CGSI website for a preview of this summer’s programs – the deadline for applications is February 1, 2017!

The full citation to our paper is:

Mangul S, Koslicki D. Reference-free comparison of microbial communities via de Bruijn graphs. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2016 Oct 2 (pp. 68-77). Association for Computing Machinery, New York.

UCLA Bioinformatics: The Philosophy of the Training Environment and Programs

(This post is jointly authored with Alexander Hoffmann, Hilary Coller, Matteo Pellegrini, and Nelson Freimer.)

UCLA has a rich training environment for Bioinformatics that extends beyond the core academic programs.  For structured academic learning, UCLA offers an Undergraduate Bioinformatics Minor and a Bioinformatics Ph.D. Program.  In addition, UCLA coordinates multiple training programs, several of which are open to researchers from other institutions who are at all stages of their careers.  Many of these programs are either hosted or jointly sponsored by the Institute for Quantitative and Computational Biology (QCB) at UCLA, which is directed by Alexander Hoffmann (UCLA).

Over the past 10 years, driven by the ubiquity of genomics throughout the field, biology has become a data science. Every biomedical research institution has been challenged with supporting the analysis of genomic data generated by groups who traditionally have not cultivated substantial computational expertise. Many of our peer institutions delegate genomic data analyses to a specific Bioinformatics core group that operates on a “fee-for-service” model.

The Bioinformatics core “fee-for-service” model poses many problems.  First, complex issues that arise during analysis of genomic data are difficult to predict in advance.  Projects often require much more effort than anticipated by research groups, leading core groups to struggle with insufficient funds to cover the actual time spent on analysis.  Second, research groups utilizing the core often want to move the project in different directions than what was originally proposed.  In the long term, exploring additional aspects of data can be inefficient when data analysis is delegated to a core group on an as-needed basis.

At UCLA we follow a different approach.  We believe that research groups should receive the training and resources to analyze the genomic data that they generate.  This “training and collaboration” model is the best solution for efficiently completing projects and advancing skills in a research group.  Over the past ten years, UCLA has significantly invested in this training and collaboration model.  For example, UCLA’s Bioinformatics programs are explicitly organized to connect research groups with core groups across campus and provide infrastructure and training to students, faculty, and staff working in many different fields.

Bioinformatics training programs held at UCLA include:

    1. The Collaboratory. The Collaboratory of postdoctoral fellows, directed by Matteo Pellegrini (UCLA), provides an experimental and empirical research environment for bioscientists and computational scientists to collaboratively design and conduct experiments. Most bioscience laboratories have limited capabilities in large-scale data analysis. The Collaboratory’s main mission is to advance genomic data analysis by connecting UCLA bioscience faculty with QCB faculty and fellows.  The Collaboratory fellows are a select group of postdocs funded by the Collaboratory to engage in collaborative projects that leverage their specific expertise.

      The Collaboratory fellows are also responsible for organizing intensive tutorials designed to train UCLA students and postdocs in the latest next-generation sequence analysis techniques. In addition to providing computational expertise to bioscience researchers at UCLA, the Collaboratory also sets up and maintains a next-generation sequence data analysis server, and participants develop methodologies to process new types of data. The Collaboratory has a year-round schedule of workshops open to the Bioinformatics community.


    1. Bruins in Genomics Undergraduate Summer Research Program (B.I.G. Summer). B.I.G. Summer is an integrated undergraduate training and research program in genomics and bioinformatics at UCLA. Participants gain an intensive, practical experience in integrating quantitative and biological knowledge while learning how to pursue graduate degrees in the biological, biomedical or health sciences.  The program begins with two weeks of hands-on tutorial workshops that cover fundamental concepts in genomics critical to participation in today’s research.  The remaining weeks are focused on research.  Students work in pairs under the supervision of UCLA faculty mentors and QCB postdoctoral fellows.

      B.I.G. Summer offers unique opportunities that are often not available to undergraduates, including next generation sequencing analysis workshops, weekly science talks by senior researchers, a weekly journal club, professional development seminars, social activities, concluding poster sessions, and a GRE test prep course.  In addition, a special NIH-funded curriculum in neurogenomics, directed by Nelson Freimer and Eleazar Eskin, provides B.I.G. Summer participants with an intensive exposure to this rapidly growing field, in which UCLA is among the leading centers worldwide. B.I.G. Summer is organized by Alexander Hoffmann, Hilary Coller, Tracy Johnson, and Eleazar Eskin. This year, B.I.G. Summer is held from June 19th to August 11th, 2017.  The B.I.G. Summer Program is sponsored by the following generous institutions:

      UCOP for a UC-HBCU partnership Program in Genomics and Systems
      NIH NIBIB for NGS Data Analysis Skills for the Biosciences Pipeline R25EB022364
      NIH NIMH for Undergraduate Research Experience in Neuropsychiatric Genomics R25MH109172


    1. Undergraduate and MS Research Program. One of the best ways for faculty to provide training to undergraduate and graduate students is through mentorship in research labs. A substantial challenge to this approach is the increasing number of undergraduate students who want to get involved in research.  For example, there are many more Computer Science majors interested in research than can be absorbed by the number of faculty presently in the Department of Computer Science.  In order to meet rising undergraduate demand for research opportunities, we created an Undergraduate and Master’s student research program.

      This program connects researchers across campus with interested students from a variety of majors.  In doing so, we leverage UCLA’s strength in Bioinformatics to offer a greater number of research opportunities available to undergraduates with and outside of the Department of Computer Science.  Each research opportunity posted on the webpage has a list of requirements, ranging from “one course in Bioinformatics or programming” to “a full year of coursework in programming.”  For students who have completed relevant coursework or are planning their academic schedule, this program provides a clearly defined path to become involved in research projects on campus.


    1. Informatics Center for Neurogenetics and Neurogenomics (ICNN). As with other areas of biomedical science, the post-genome era raises the prospect of transformational advances in neuroscience research. However, neuroscience faces special challenges in analysis, interpretation, and management of the vast quantities of information generated by genetic and genomic technologies. The phenotypic and organizational complexity of the nervous system calls for distinct analytical and informatics strategies and expertise.

      The ICNN, directed by Nelson Freimer and Giovanni Coppola, provides advanced analysis and informatics support to a highly interactive group of neuroscientists at UCLA who conduct basic, clinical, and translational research.  Generally, today’s lack of corresponding resources in analysis and informatics constitutes a bottleneck in their research; ICNN provides for these investigators access to excellent facilities for genetics and genomics experimentation.  ICNN faculty are experts in statistical genetics, gene expression analysis, and bioinformatics, and they oversee the activities of highly-trained staff members in  accomplishing three goals: (1) Providing expert consultation and analyses for neurogenetics and neurogenomics projects;  (2) Developing and maintaining a shared computing resource that is incorporated within the large campus-wide computational cluster for computation-intensive analyses, web-servers, and state of the art software tools for a wide range of applications (including user-friendly versions of public databases, as well as workstations on which ICNN users will be trained to employ these tools); (3) Providing hands-on training in analysis and informatics to group users.


  1. Computational Genomics Summer Institute (CGSI). In 2015, Profs. Eleazar Eskin (UCLA), Eran Halperin (UCLA), John Novembre (The University of Chicago), and Ben Raphael (Princeton University) created CGSI. A collaboration with the Institute for Pure and Applied Mathematics (IPAM), led by Russ Caflisch, CGSI is developing a flexible program for improving education and enhancing collaboration in Bioinformatics research. The goal of this summer research program is to bring together mathematical and computational scientists, sequencing technology developers in both industry and academia, and the biologists who use the instruments for particular research applications.

    CGSI is a unique opportunity for junior and senior scholars in Bioinformatics to foster collaborative relationships, accelerate problem-solving, and unleash the full potential of their projects.  The program facilitates interdisciplinary collaboration and training with a mix of formal and informal events. For example, senior scholars present traditional research talks and tutorials, while junior scholars present mini-presentations and organize journal clubs.  CGSI fosters interactions over an extended period of time and is laying crucial groundwork to advance the mathematical foundations of this exciting field.  This year, CGSI will be held from July 6th-26th, 2017. CGSI is made possible by National Institutes of Health grant GM112625.


“Give a Man a Fish, and You Feed Him for a Day. Teach a Man to Fish, and You Feed Him for a Lifetime.”

Register Now for UCLA Computational Genomics Summer Institute 2017

CGSI brings together mathematical and computational scientists, sequencing technology developers in both industry and academia, and the biologists who use the instruments for particular research applications. Research talks, workshops, journal clubs, and social events provide a unique opportunity to foster interactions between these three communities over an extended period of time and advance the mathematical foundations of this exciting field.

SHORT PROGRAM: July 10 – 14, 2017
LONG PROGRAM: July 6 – 26, 2017
@ UCLA Campus, Los Angeles

Visit our website to learn more:


Register now for this upcoming summer’s Short and Long Courses:
The deadline to register for the 2017 programs is February 1, 2017.


In 2015, Profs. Eleazar Eskin (UCLA), Eran Halperin (UCLA), John Novembre (The University of Chicago), and Ben Raphael (Brown University) created the Computational Genomics Summer Institute (CGSI). A collaboration with the Institute for Pure and Applied Mathematics (IPAM) led by Russ Caflisch, CGSI aims to develop a flexible program for improving education and enhancing collaboration in Bioinformatics research.

Over the past two decades, technological developments have substantially changed research in Bioinformatics. New methods in DNA sequencing technologies are capable of performing large-scale measurements of cellular states with a lower cost and higher efficiency of computing time. These improvements have revolutionized the potential application of genomic studies toward clinical research and development of novel diagnostic tools and treatments for human disease.


Eleazar Eskin
University of California, Los Angeles
CGSI Director

Eran Halperin
University of California, Los Angeles
CGSI Director

Russ Caflisch
University of California, Los Angeles
IPAM Director

John Novembre
University of Chicago

Ben Raphael
Brown University

Francesca Chiaromonte
Penn State University

2017 Faculty

Note: This is a list of confirmed faculty for the 2017 programs, as of Jan. 20. We will expand this list in coming weeks.

Kin Fai Au, University of Iowa domain-tiniest
Brian Browning, University of Washington domain-tiniest
Jason Ernst, UCLA domain-tiniest
Eleazar Eskin, UCLA domain-tiniest
Jonathan Flint, UCLA domain-tiniest
Ilan Gronau, Herzliya Interdisciplinary Center domain-tiniest
Eran Halperin, UCLA domain-tiniest
Jo Hardin, Pomona College domain-tiniest
Fereydoun Hormozdiari, University of California, Davis domain-tiniest
David Koslicki, Oregon State University domain-tiniest
Jessica (Jingyi) Li, UCLA + IPAM domain-tiniest
Jennifer Listgarten, Microsoft Research domain-tiniest
Kirk Lohmueller, UCLA domain-tiniest
John Novembre, University of Chicago domain-tiniest
Lior Pachter, University of California, Berkeley domain-tiniest
Bogdan Pasaniuc, UCLA domain-tiniest
Ben Raphael, Princeton University domain-tiniest
Gunnar Rätsch, Eidgenössische Technische Hochschule Zürich domain-tiniest
Saharon Rosset, Tel Aviv University domain-tiniest
Cenk Sahinalp, Simon Fraser University domain-tiniest
Sriram Sankararaman, UCLA domain-tiniest
Alexander Schönhuth, Centrum Wiskunde & Informatica, Amsterdam and UCLA IPAM domain-tiniest
Sagi Snir, University of Haifa domain-tiniest
Jae-Hoon Sul, UCLA domain-tiniest
Fabio Vandin, University of Padova domain-tiniest
Daniel Wegmann, Université de Fribourg domain-tiniest
William (Xiaoquan) Wen, University of Michigan domain-tiniest
Noah Zaitlen, University of California San Francisco domain-tiniest
Alex Zelikovsky, Georgia State University domain-tiniest
Or Zuk, Hebrew University of Jerusalem domain-tiniest

CGSI is made possible by National Institutes of Health grant GM112625.

Read more about CGSI’s 2016 programs at the ZarLab blog: