Haplotype Phasing from Sequence Data

The haplotype phasing problem.

The classical haplotype phasing problem.

Over the past few years, our group has written several papers on inferring haplotypes from sequence data.

The problem of Haplotype Inference referred to as Haplotype Phasing has had a long history in computational genetics and the problem itself has had several incarnations.  Genotyping technologies obtain “genotype” information on SNPs which mixes the genetic information from both chromosomes.  However, many genetic analyses require “haplotype” information which is the genetic information on each chromosome (see Figure).

In the early days before reference datasets were available, methods would be applied to large numbers of genotyped individuals which would attempt to identify a small number of haplotypes which explained the majority of the individual genotypes.  Methods from this period include PHASE (11254454) and HAP (14988101) (from our group with Eran Halperin).  The figure is actually one of Eran’s slides from around 2002.

Once reference datasets such as the HapMap became available, imputation based methods such as IMPUTE(10.1038/ng2088) and BEAGLE(10.1016/j.ajhg.2009.01.005) dominated previous phasing approaches because they leveraged information from the carefully curated reference datasets.

In principal, haplotype phasing or imputation methods can be applied directly to sequencing data by first calling genotypes in the sequencing data and then applying a phasing or imputation approach.  However, since each read originates from only one chromosome, if a read spans two genotypes it provides some information on haplotype phase.  Combining these reads to construct haplotypes is referred to as the “haplotypes assembly” problem which was pioneered by   Vikas Bansal and Vineet Bafna(10.1093/bioinformatics/btn298),(10.1101/gr.077065.108).  Dan He in our group developed an optimal method for haplotype assembly which guarantees finding the optimal solution for short reads and reduces the problem of haplotype assembly for longer reads to MaxSAT which finds the optimal solution for the vast majority of problem instances(10.1093/bioinformatics/btq215). More recently, others have developed methods that can discover optimal solutions for all problem instances(10.1093/bioinformatics/btt349). In his paper, Dan also showed that haplotype assembly will always underperform traditional phasing methods for short read sequencing data because too few of the reads span multiple genotypes.

To overcome this issue, Dan extended his methods to jointly perform imputation and haplotype assembly(10.1089/cmb.2012.0091),(10.1016/j.gene.2012.11.093).  These methods outperformed both imputation methods and haplotype assembly methods but unfortunately are too slow and memory intensive to apply in practice.  More recently, in our group, Wen-Yun Yang, Zhanyong Wang, Farhad Hormozdiari with Bogdan Pasaniuc developed a sampling method which is both fast and accurate for combining haplotype assembly and imputation(10.1093/bioinformatics/btt386).

Full citations of our papers are here:


He, Dan; Han, Buhm ; Eskin, Eleazar

Hap-seq: An Optimal Algorithm for Haplotype Phasing with Imputation Using Sequencing Data. Journal Article

In: J Comput Biol, 20 (2), pp. 80-92, 2013, ISSN: 1557-8666.

Abstract | Links | BibTeX


Yang, Wen-Yun Y; Hormozdiari, Farhad ; Wang, Zhanyong ; He, Dan ; Pasaniuc, Bogdan ; Eskin, Eleazar

Leveraging Multi-SNP Reads from Sequencing Data for Haplotype Inference. Journal Article

In: Bioinformatics, 2013, ISSN: 1367-4811.

Abstract | Links | BibTeX


He, Dan; Eskin, Eleazar

Hap-seqX: Expedite Algorithm for Haplotype Phasing with Imputation using Sequence Data. Journal Article

In: Gene, 2012, ISSN: 1879-0038.

Abstract | Links | BibTeX


He, Dan; Choi, Arthur ; Pipatsrisawat, Knot ; Darwiche, Adnan ; Eskin, Eleazar

Optimal algorithms for haplotype assembly from whole-genome sequence data. Journal Article

In: Bioinformatics, 26 (12), pp. i183-90, 2010, ISSN: 1367-4811.

Abstract | Links | BibTeX


Sequencing with DNA Pools

Our group has recently published several papers on sequencing using DNA pools.  These include two methods for obtaining genotypes from pools(10.1186/1471-2105-12-S6-S2)(10.1109/ACSSC.2012.6489173), a method for correcting for errors when mixing the DNA into pools(10.1007/978-3-642-37195-0_4), and a method for performing association for rare variants when the sequence data is collected using pools(10.1534/genetics.113.150169).

High-throughput sequencing (HTS) technology has decreased the cost of sequencing for one individual tremendously in the past few years, however to perform genome-wide association studies (GWAS) we need to collect large cohorts having the disease (called cases) and cohorts not having the disease (called controls). Unfortunately, performing whole genome sequencing for large cohorts is still very expensive.

The actual cost of sequencing a sample consists of two parts. The first part is the cost of preparing a DNA sample for sequencing which is refereed to as library preparation cost. Library preparation is also the most labor-intensive part of a sequencing study. The second part is the cost of the actual sequencing, which is proportional to the amount of sequence, collected which we refer to as the sequencing per-base cost. Technological advances are rapidly reducing the per-base cost of sequencing while the library preparation costs are more stable (Figure1).


The first step of extracting the DNA and making it ready for sequencing is referred to as library preparation and the second step is to generate the DNA sequence from the pool of individuals. Library preparation is the costly step and labor-intensive compare to the second step.


Erlich et al. (10.1101/gr.092957.109) introduced the concept of DNA pooling. The basic idea behind this approach is that DNA from multiple individuals are pooled together into a single DNA mixture which is then prepared as a single library and sequenced. In this approach, the library preparation cost is reduced because one library is prepared per pool instead of one library per sample.

Pooling methods can be split into two categories. The first category puts each individual in only one pool and each pool consist of fixed number of individuals.   These types of methods are referred to as non-overlapping pool methods. The second category puts each individual in multiple pools and use this information to recover each individual’s genotype.  These methods are referred to as overlapping pool methods.

Many studies (10.1101/gr.088559.108), (10.1093/nar/gkq675) (10.1186/1471-2105-12-S6-S2) have shown using overlapping pools we can recover the rare SNPs with high accuracy.  In our work, we develop two methods to detect the genotype of both rare and common variances from pool sequencing (10.1109/ACSSC.2012.6489173). The idea is that we take advantage of genotypes on a subset of the variants which is often available for these cohorts.  Both methods tend to have better accuracy than imputation methods, which is the standard approach to predict the genotypes of variants which were not collected.

Pooling have been successful to detect the rare variants, which is the main reason many GWAS have used pooling to detect the rare casual SNPs ((10.1101/gr.094680.109), (10.1038/ng.952)). However, all these methods make the assumption that all individuals have the same abundance level in the pool. The abundance level for each individual is the fraction of the reads in a pool originated from that specific individual. We show in our paper (10.1007/978-3-642-37195-0_4) that this simple assumption is not true, and ignoring the fact that some individuals can have different abundance level can lead to spurious associations. In our paper, we describe a probabilistic model that can detect the abundance levels of individuals when genotype data on a subset of the variants is available.  Furthermore, we extend the model to the case the genotype of one of individual is missing. We showed leveraging the linkage disequilibrium (LD) pattern decrease the error rate.

Finally, in another recent paper(10.1534/genetics.113.150169), we extend methods for implicating rare variants in disease to data which is collected using DNA sequencing pools.

The full citations of our four papers are below.


Navon, Oron; Sul, Jae Hoon ; Han, Buhm ; Conde, Lucia ; Bracci, Paige ; Riby, Jacques ; Skibola, Christine F; Eskin, Eleazar ; Halperin, Eran

Rare Variant Association Testing Under Low-Coverage Sequencing. Journal Article

In: Genetics, 2013, ISSN: 1943-2631.

Abstract | Links | BibTeX


Eskin, Itamar; Hormozdiari, Farhad ; Conde, Lucia ; Riby, Jacques ; Skibola, Chris ; Eskin, Eleazar ; Halperin, Eran

eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data Conference

Research in Computational Molecular Biology, Tel-Aviv University Springer Berlin Heidelberg, 2013.

Abstract | Links | BibTeX


Hormozdiariy, Farhad; Wang, Zhanyong ; Yang, Wen-Yun - Y; Eskin, Eleazar

Efficient genotyping of individuals using overlapping pool sequencing and imputation Conference

2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), IEEE, 2012, ISBN: 978-1-4673-5051-8.

Abstract | Links | BibTeX


He, Dan; Zaitlen, Noah ; Pasaniuc, Bogdan ; Eskin, Eleazar ; Halperin, Eran

Genotyping common and rare variation using overlapping pool sequencing. Journal Article

In: BMC Bioinformatics, 12 Suppl 6 , pp. S2, 2011, ISSN: 1471-2105.

Abstract | Links | BibTeX




RNA Editing detection using High-Through Sequencing

The central dogma of biology indicates DNA sequence gets transcribed to RNA sequence and then RNA sequence gets translated to protein. Thus, for long time it was known fact that each base of RNA sequence corresponds to an exact base in DNA sequence. However, Mahendran et al. (10.1038/349434a0) discovered for the first time this one to one relation is not necessary true. The phenomena where RNA sequences and DNA sequences are different is known as RNA editing(RNA DNA Difference). Although the underlying cause for RNA editing is still unknown, it is known A to I editing is the most common. A-I editing occurs when adenine (A) DNA base converts to guanine (G) base. On the other hand other sorts of RNA editing in mammalian genomes was known to be rare until recently where Li et al. (10.1126/science.1207018) reported 10,000 cites of RNA editing in human cancer cell lines where a significant number of them are not A-I editing. This study was the first that use the high-through sequencing (HTS) technologies to detect the RNA editing in whole genome scale. Following this study series of works supported the Li et al. (10.1126/science.1207018) results as the RNA editing is more common as was known before HTS era. On the orthogonal direction series of works (10.1371/journal.pone.0025842), (10.1126/science.1209658), (10.1126/science.1210484), and (10.1126/science.1210624) indicate vast majority of RNA editing observed in the HTS data is due to systematic error in sequencing process.

We use mouse as a model organism to study the RNA editing in mammalian genomes. We use the F1 cross of C57BL/6 and DBA. Leveraging the power of F1 mice and the fact both strains where deeply sequenced by Sanger institute (10.1038/nature10413) provide us with an ease framework to study RNA editing in mammalian genomes. Furthermore, to remove any technical artifacts we use biological replicate of the same F1 cross and we consider the mRNA of both liver and adipose tissues. In our paper (10.1534/genetics.112.149054) we used a set of stringent conditions to make sure our results contain no possible sequencing artifacts. Although, our stringent conditions may remove some true positive, our goal is to illustrate the existing of sequencing artifacts and further indicates the RNA editing beside the A-I exists but not as common as A-I editing. We found 63 sites in liver and 216 sites in adipose which are RNA editing.