UCLA Launches CGSI with Inaugural Summer Programs

In 2015, Profs. Eleazar Eskin (UCLA), Eran Halperin (UCLA), John Novembre (The University of Chicago), and Ben Raphael (Brown University) created the Computational Genomics Summer Institute (CGSI). A collaboration with the Institute for Pure and Applied Mathematics (IPAM) led by Russ Caflisch, CGSI aims to develop a flexible program for improving education and enhancing collaboration in Bioinformatics research. In summer 2016, the inaugural program included a five-day short course (July 18-22) followed by a three-week long course (July 22 to August 12).

Over the past two decades, technological developments have substantially changed research in Bioinformatics. New methods in DNA sequencing technologies are capable of performing large-scale measurements of cellular states with a lower cost and higher efficiency of computing time. These improvements have revolutionized the potential application of genomic studies toward clinical research and development of novel diagnostic tools and treatments for human disease.

Modern genomic data collection creates an enormous need for mathematical and computational infrastructures capable of analyzing datasets that are increasingly larger in scale and resolution. This poses several unique challenges to researchers in Bioinformatics, an interdisciplinary field that cuts across traditional academic fields of math, statistics, computer science, and biology—and includes private-industry sequence technology developers. Innovation depends on seamless collaboration among scientists with different skill sets, communication styles, and institution-driven career goals. Therefore, impactful Bioinformatics research requires an original framework for doing science that bridges traditional discipline-based academic structures.

The summer 2016 courses combined formal research talks and tutorials with informal interaction and mentorship in order to facilitate exchange among international researchers. Participants in the short program attended five full days packed with lectures, tutorials, and journal clubs covering a variety of cutting-edge techniques. Senior trainees, including advanced graduate students and post-docs, underwent additional training through the long program’s residence program. The extended program enabled these scientists to interact with leading researchers through a mix of structured training programs and flexible time for collaboration with fellow participants and other program faculty.

Collaboration on a wide variety of problem types and research themes facilitated cross-disciplinary communication and networking. During both courses, CGSI participants shared technical skills in coding and data analysis relevant to genetic and epigenetic imputation, fine-mapping of complex traits, linear mixed models, and Bayesian statistics in human, canine, mouse, and bacteria datasets. Scholars at different stages of their careers explored application of these methods, among others, to emerging themes such as cancer, neuropsychiatric disorders, evolutionary adaptation, early human origins, and data privacy.

CGSI instructors and participants established mentor-mentee relationships in computational genomics labs at UCLA, including the ZarLab and Bogdan Lab, while tackling practical problems and laying groundwork for future publications. In addition, participants developed comradery and professional connections while enjoying a full schedule of social activities, including dinners at classic Los Angeles area restaurants, volleyball tournaments in Santa Monica, bike rides along the beach, morning runs around UCLA campus, and even an excursion to see a live production of “West Side Story” at the Hollywood Bowl.

CGSI organizers thank the National Institutes of Health grant GM112625, UCLA Clinical and Translational Science Institute grant UL1TR000124, and IPAM for making this unique program possible. We look forward to fostering more collaboration between mathematicians, computer scientists, biologists, and sequencing technology developers in both industry and academia with future CGSI programs.

Visit the CGSI website for an up-to-date archive of program videos, slides, papers, and more:

Enrollment in 2017 CGSI programs opens this fall with a registration deadline of February 1.

This slideshow requires JavaScript.

Sequencing with DNA Pools

Our group has recently published several papers on sequencing using DNA pools.  These include two methods for obtaining genotypes from pools(10.1186/1471-2105-12-S6-S2)(10.1109/ACSSC.2012.6489173), a method for correcting for errors when mixing the DNA into pools(10.1007/978-3-642-37195-0_4), and a method for performing association for rare variants when the sequence data is collected using pools(10.1534/genetics.113.150169).

High-throughput sequencing (HTS) technology has decreased the cost of sequencing for one individual tremendously in the past few years, however to perform genome-wide association studies (GWAS) we need to collect large cohorts having the disease (called cases) and cohorts not having the disease (called controls). Unfortunately, performing whole genome sequencing for large cohorts is still very expensive.

The actual cost of sequencing a sample consists of two parts. The first part is the cost of preparing a DNA sample for sequencing which is refereed to as library preparation cost. Library preparation is also the most labor-intensive part of a sequencing study. The second part is the cost of the actual sequencing, which is proportional to the amount of sequence, collected which we refer to as the sequencing per-base cost. Technological advances are rapidly reducing the per-base cost of sequencing while the library preparation costs are more stable (Figure1).


The first step of extracting the DNA and making it ready for sequencing is referred to as library preparation and the second step is to generate the DNA sequence from the pool of individuals. Library preparation is the costly step and labor-intensive compare to the second step.


Erlich et al. (10.1101/gr.092957.109) introduced the concept of DNA pooling. The basic idea behind this approach is that DNA from multiple individuals are pooled together into a single DNA mixture which is then prepared as a single library and sequenced. In this approach, the library preparation cost is reduced because one library is prepared per pool instead of one library per sample.

Pooling methods can be split into two categories. The first category puts each individual in only one pool and each pool consist of fixed number of individuals.   These types of methods are referred to as non-overlapping pool methods. The second category puts each individual in multiple pools and use this information to recover each individual’s genotype.  These methods are referred to as overlapping pool methods.

Many studies (10.1101/gr.088559.108), (10.1093/nar/gkq675) (10.1186/1471-2105-12-S6-S2) have shown using overlapping pools we can recover the rare SNPs with high accuracy.  In our work, we develop two methods to detect the genotype of both rare and common variances from pool sequencing (10.1109/ACSSC.2012.6489173). The idea is that we take advantage of genotypes on a subset of the variants which is often available for these cohorts.  Both methods tend to have better accuracy than imputation methods, which is the standard approach to predict the genotypes of variants which were not collected.

Pooling have been successful to detect the rare variants, which is the main reason many GWAS have used pooling to detect the rare casual SNPs ((10.1101/gr.094680.109), (10.1038/ng.952)). However, all these methods make the assumption that all individuals have the same abundance level in the pool. The abundance level for each individual is the fraction of the reads in a pool originated from that specific individual. We show in our paper (10.1007/978-3-642-37195-0_4) that this simple assumption is not true, and ignoring the fact that some individuals can have different abundance level can lead to spurious associations. In our paper, we describe a probabilistic model that can detect the abundance levels of individuals when genotype data on a subset of the variants is available.  Furthermore, we extend the model to the case the genotype of one of individual is missing. We showed leveraging the linkage disequilibrium (LD) pattern decrease the error rate.

Finally, in another recent paper(10.1534/genetics.113.150169), we extend methods for implicating rare variants in disease to data which is collected using DNA sequencing pools.

The full citations of our four papers are below.


Navon, Oron; Sul, Jae Hoon ; Han, Buhm ; Conde, Lucia ; Bracci, Paige ; Riby, Jacques ; Skibola, Christine F; Eskin, Eleazar ; Halperin, Eran

Rare Variant Association Testing Under Low-Coverage Sequencing. Journal Article

In: Genetics, 2013, ISSN: 1943-2631.

Abstract | Links | BibTeX


Eskin, Itamar; Hormozdiari, Farhad ; Conde, Lucia ; Riby, Jacques ; Skibola, Chris ; Eskin, Eleazar ; Halperin, Eran

eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data Conference

Research in Computational Molecular Biology, Tel-Aviv University Springer Berlin Heidelberg, 2013.

Abstract | Links | BibTeX


Hormozdiariy, Farhad; Wang, Zhanyong ; Yang, Wen-Yun - Y; Eskin, Eleazar

Efficient genotyping of individuals using overlapping pool sequencing and imputation Conference

2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), IEEE, 2012, ISBN: 978-1-4673-5051-8.

Abstract | Links | BibTeX


He, Dan; Zaitlen, Noah ; Pasaniuc, Bogdan ; Eskin, Eleazar ; Halperin, Eran

Genotyping common and rare variation using overlapping pool sequencing. Journal Article

In: BMC Bioinformatics, 12 Suppl 6 , pp. S2, 2011, ISSN: 1471-2105.

Abstract | Links | BibTeX




Read Mapping Uncertainty and Copy Number Variation (CNV)


Similar copies of a copy number variations (CNV) region exist in the reference genome. ‘‘C’’ and ‘‘T’’ are the only different nucleotides between region A and B. Reads {r1‚r2‚…‚r6} are obtained from the donor genome as shown in the lower part of the figure. Furthermore, these reads can be mapped to the reference genome as shown in the upper part of the figure.

Identifying copy number variation from high throughput sequencing data is a very active research area(10.1038/nrg2958).  Typical approaches map short sequence reads from a donor genome to a reference genome and then examine the number of reads that map to each region.  The idea is that if few reads map to a region, this suggests that the corresponding portion of the donor genome was deleted and a large number of reads mapping to a region suggests that the corresponding region is duplicated in the donor genome.

This method works very well when the duplicated or deleted region is unique and reads originating from that region can only map to a single location.  Unfortunately, many copy number variations occur in regions which themselves are duplicated in the genome.  Reads originating from these regions map to multiple positions in the reference.  Incorrect placement of these reads can then result in wildly incorrect copy number predictions.

We recently published a paper on dealing with read mapping uncertainty when predicting copy number variation(10.1089/cmb.2012.0258).  Instead of mapping the reads to a single location, we keep a probability distribution over all of the locations that they can map.  Then we iteratively estimate the copy number and then remap the reads using these estimates.  What results is that the few reads that span the small number of differences between the copies (as shown in the figure from the paper) end up being the clues to correctly determine which region was copied.

Ph.D. students Zhanyong Wang, Farhad Hormozdiari, Wen-Yun Yang worked on this project which was a collaboration with Eran Halperin.

Full Citation:

Wang, Zhanyong, Farhad Hormozdiari, Wen-Yun Yang, Eran Halperin, and Eleazar Eskin. 2013. CNVeM: Copy number variation detection using uncertainty of read mapping. J Comput Biol doi:10.1089/cmb.2012.0258


Copy number variations (CNVs) are widely known to be an important mediator for diseases and traits. The development of high-throughput sequencing (HTS) technologies has provided great opportunities to identify CNV regions in mammalian genomes. In a typical experiment, millions of short reads obtained from a genome of interest are mapped to a reference genome. The mapping information can be used to identify CNV regions. One important challenge in analyzing the mapping information is the large fraction of reads that can be mapped to multiple positions. Most existing methods either only consider reads that can be uniquely mapped to the reference genome or randomly place a read to one of its mapping positions. Therefore, these methods have low power to detect CNVs located within repeated sequences. In this study, we propose a probabilistic model, CNVeM, that utilizes the inherent uncertainty of read mapping. We use maximum likelihood to estimate locations and copy numbers of copied regions and implement an expectation-maximization (EM) algorithm. One important contribution of our model is that we can distinguish between regions in the reference genome that differ from each other by as little as 0.1%. As our model aims to predict the copy number of each nucleotide, we can predict the CNV boundaries with high resolution. We apply our method to simulated datasets and achieve higher accuracy compared to CNVnator. Moreover, we apply our method to real data from which we detected known CNVs. To our knowledge, this is the first attempt to predict CNVs at nucleotide resolution and to utilize uncertainty of read mapping.