Haplotype Phasing from Sequence Data

The haplotype phasing problem.

The classical haplotype phasing problem.

Over the past few years, our group has written several papers on inferring haplotypes from sequence data.

The problem of Haplotype Inference referred to as Haplotype Phasing has had a long history in computational genetics and the problem itself has had several incarnations.  Genotyping technologies obtain “genotype” information on SNPs which mixes the genetic information from both chromosomes.  However, many genetic analyses require “haplotype” information which is the genetic information on each chromosome (see Figure).

In the early days before reference datasets were available, methods would be applied to large numbers of genotyped individuals which would attempt to identify a small number of haplotypes which explained the majority of the individual genotypes.  Methods from this period include PHASE (11254454) and HAP (14988101) (from our group with Eran Halperin).  The figure is actually one of Eran’s slides from around 2002.

Once reference datasets such as the HapMap became available, imputation based methods such as IMPUTE(10.1038/ng2088) and BEAGLE(10.1016/j.ajhg.2009.01.005) dominated previous phasing approaches because they leveraged information from the carefully curated reference datasets.

In principal, haplotype phasing or imputation methods can be applied directly to sequencing data by first calling genotypes in the sequencing data and then applying a phasing or imputation approach.  However, since each read originates from only one chromosome, if a read spans two genotypes it provides some information on haplotype phase.  Combining these reads to construct haplotypes is referred to as the “haplotypes assembly” problem which was pioneered by   Vikas Bansal and Vineet Bafna(10.1093/bioinformatics/btn298),(10.1101/gr.077065.108).  Dan He in our group developed an optimal method for haplotype assembly which guarantees finding the optimal solution for short reads and reduces the problem of haplotype assembly for longer reads to MaxSAT which finds the optimal solution for the vast majority of problem instances(10.1093/bioinformatics/btq215). More recently, others have developed methods that can discover optimal solutions for all problem instances(10.1093/bioinformatics/btt349). In his paper, Dan also showed that haplotype assembly will always underperform traditional phasing methods for short read sequencing data because too few of the reads span multiple genotypes.

To overcome this issue, Dan extended his methods to jointly perform imputation and haplotype assembly(10.1089/cmb.2012.0091),(10.1016/j.gene.2012.11.093).  These methods outperformed both imputation methods and haplotype assembly methods but unfortunately are too slow and memory intensive to apply in practice.  More recently, in our group, Wen-Yun Yang, Zhanyong Wang, Farhad Hormozdiari with Bogdan Pasaniuc developed a sampling method which is both fast and accurate for combining haplotype assembly and imputation(10.1093/bioinformatics/btt386).

Full citations of our papers are here:

Sorry, no publications matched your criteria.


Read Mapping Uncertainty and Copy Number Variation (CNV)


Similar copies of a copy number variations (CNV) region exist in the reference genome. ‘‘C’’ and ‘‘T’’ are the only different nucleotides between region A and B. Reads {r1‚r2‚…‚r6} are obtained from the donor genome as shown in the lower part of the figure. Furthermore, these reads can be mapped to the reference genome as shown in the upper part of the figure.

Identifying copy number variation from high throughput sequencing data is a very active research area(10.1038/nrg2958).  Typical approaches map short sequence reads from a donor genome to a reference genome and then examine the number of reads that map to each region.  The idea is that if few reads map to a region, this suggests that the corresponding portion of the donor genome was deleted and a large number of reads mapping to a region suggests that the corresponding region is duplicated in the donor genome.

This method works very well when the duplicated or deleted region is unique and reads originating from that region can only map to a single location.  Unfortunately, many copy number variations occur in regions which themselves are duplicated in the genome.  Reads originating from these regions map to multiple positions in the reference.  Incorrect placement of these reads can then result in wildly incorrect copy number predictions.

We recently published a paper on dealing with read mapping uncertainty when predicting copy number variation(10.1089/cmb.2012.0258).  Instead of mapping the reads to a single location, we keep a probability distribution over all of the locations that they can map.  Then we iteratively estimate the copy number and then remap the reads using these estimates.  What results is that the few reads that span the small number of differences between the copies (as shown in the figure from the paper) end up being the clues to correctly determine which region was copied.

Ph.D. students Zhanyong Wang, Farhad Hormozdiari, Wen-Yun Yang worked on this project which was a collaboration with Eran Halperin.

Full Citation:

Wang, Zhanyong, Farhad Hormozdiari, Wen-Yun Yang, Eran Halperin, and Eleazar Eskin. 2013. CNVeM: Copy number variation detection using uncertainty of read mapping. J Comput Biol doi:10.1089/cmb.2012.0258


Copy number variations (CNVs) are widely known to be an important mediator for diseases and traits. The development of high-throughput sequencing (HTS) technologies has provided great opportunities to identify CNV regions in mammalian genomes. In a typical experiment, millions of short reads obtained from a genome of interest are mapped to a reference genome. The mapping information can be used to identify CNV regions. One important challenge in analyzing the mapping information is the large fraction of reads that can be mapped to multiple positions. Most existing methods either only consider reads that can be uniquely mapped to the reference genome or randomly place a read to one of its mapping positions. Therefore, these methods have low power to detect CNVs located within repeated sequences. In this study, we propose a probabilistic model, CNVeM, that utilizes the inherent uncertainty of read mapping. We use maximum likelihood to estimate locations and copy numbers of copied regions and implement an expectation-maximization (EM) algorithm. One important contribution of our model is that we can distinguish between regions in the reference genome that differ from each other by as little as 0.1%. As our model aims to predict the copy number of each nucleotide, we can predict the CNV boundaries with high resolution. We apply our method to simulated datasets and achieve higher accuracy compared to CNVnator. Moreover, we apply our method to real data from which we detected known CNVs. To our knowledge, this is the first attempt to predict CNVs at nucleotide resolution and to utilize uncertainty of read mapping.