Our group publishes papers presenting new methodologies, describing the results of studies that use our software, and reviewing current topics in the field of Bioinformatics. Scroll down or click here for a complete list of papers produced by our lab. Since 2013, we write blog posts summarizing new research papers and review articles:
GWAS
- Fine Mapping Causal Variants and Allelic Heterogeneity
- Widespread Allelic Heterogeneity in Complex Traits
- Selection in Europeans on Fatty Acid Desaturases Associated with Dietary Changes
- Incorporating prior information into association studies
- Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder
- Simultaneous modeling of disease status and clinical phenotypes to increase power in GWAS
- Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Colocalization of GWAS and eQTL Signals Detects Target Genes
- Chromosome conformation elucidates regulatory relationships in developing human brain
Mouse Genetics
- Review Article: The Hybrid Mouse Diversity Panel
- Genes, Environments and Meta-Analysis
- Review Article: Mixed Models and Population Structure
- Identifying Genes Involved in Blood Cell Traits
- Genes, Diet, and Body Weight (in Mice)
- Review Article: Mouse Genetics
Population Structure
- Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models
- Multiple testing correction in linear mixed models
- Identification of causal genes for complex traits (CAVIAR-gene)
- Accurate viral population assembly from ultra-deep sequencing data
- GRAT: Speeding up Expression Quantitative Trail Loci (eQTL) Studies
- Correcting Population Structure using Mixed Models Webcast
- Mixed models can correct for population structure for genomic regions under selection
Review Articles
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Review Article: The Hybrid Mouse Diversity Panel
- Review Article: GWAS and Missing Heritability
- Review Article: Mixed Models and Population Structure
- Review Article: Mouse Genetics
Publications
2017 |
Mangul, Serghei; Yang, Harry Taegyun; Hormozdiari, Farhad; Dainis, Alex; Tseng, Elizabeth; Ashley, Euan A; Zelikovsky, Alex; Eskin, Eleazar HapIso : An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads. Journal Article IEEE Trans Nanobioscience, 16 (2), pp. 108-115, 2017, ISSN: 1558-2639. Abstract | Links | BibTeX | Tags: Allele Specific Expression, Haplotype Phasing, Haplotyping from Sequences, RNAseq, Sequence Assembly @article{Mangul:IeeeTransNanobioscience:2017, title = {HapIso : An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads.}, author = { Serghei Mangul and Harry Taegyun Yang and Farhad Hormozdiari and Alex Dainis and Elizabeth Tseng and Euan A. Ashley and Alex Zelikovsky and Eleazar Eskin}, url = {http://dx.doi.org/10.1109/TNB.2017.2675981}, issn = {1558-2639}, year = {2017}, date = {2017-01-01}, journal = {IEEE Trans Nanobioscience}, volume = {16}, number = {2}, pages = {108-115}, address = {United States}, abstract = {Sequencing of RNA provides the possibility to study an individual's transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present HapIso (Haplotype-specific Isoform Reconstruction), a method able to tolerate the relatively high error-rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k-means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We used family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error-rate and accurately partition the reads into the parental alleles of the isoform transcripts. We also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate ASE of genes of interest. Our method was able to correct reads and determine Glu1883Lys point mutation of clinical signifcance validated by GeneDx HCM Panel. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads}, keywords = {Allele Specific Expression, Haplotype Phasing, Haplotyping from Sequences, RNAseq, Sequence Assembly}, pubstate = {published}, tppubtype = {article} } Sequencing of RNA provides the possibility to study an individual's transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present HapIso (Haplotype-specific Isoform Reconstruction), a method able to tolerate the relatively high error-rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k-means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We used family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error-rate and accurately partition the reads into the parental alleles of the isoform transcripts. We also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate ASE of genes of interest. Our method was able to correct reads and determine Glu1883Lys point mutation of clinical signifcance validated by GeneDx HCM Panel. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads |
2013 |
He, Dan; Han, Buhm ; Eskin, Eleazar Hap-seq: An Optimal Algorithm for Haplotype Phasing with Imputation Using Sequencing Data. Journal Article J Comput Biol, 20 (2), pp. 80-92, 2013, ISSN: 1557-8666. Abstract | Links | BibTeX | Tags: Haplotyping from Sequences, Imputation @article{He:JComputBiol:2013, title = {Hap-seq: An Optimal Algorithm for Haplotype Phasing with Imputation Using Sequencing Data.}, author = { Dan He and Buhm Han and Eleazar Eskin}, url = {http://dx.doi.org/10.1089/cmb.2012.0091}, issn = {1557-8666}, year = {2013}, date = {2013-01-01}, journal = {J Comput Biol}, volume = {20}, number = {2}, pages = {80-92}, address = {United States}, organization = {1 IBM T.J. Watson Research , Yorktown Heights, NY.}, abstract = {Abstract Inference of haplotypes, or the sequence of alleles along each chromosome, is a fundamental problem in genetics and is important for many analyses, including admixture mapping, identifying regions of identity by descent, and imputation. Traditionally, haplotypes are inferred from genotype data obtained from microarrays using information on population haplotype frequencies inferred from either a large sample of genotyped individuals or a reference dataset such as the HapMap. Since the availability of large reference datasets, modern approaches for haplotype phasing along these lines are closely related to imputation methods. When applied to data obtained from sequencing studies, a straightforward way to obtain haplotypes is to first infer genotypes from the sequence data and then apply an imputation method. However, this approach does not take into account that alleles on the same sequence read originate from the same chromosome. Haplotype assembly approaches take advantage of this insight and predict haplotypes by assigning the reads to chromosomes in such a way that minimizes the number of conflicts between the reads and the predicted haplotypes. Unfortunately, assembly approaches require very high sequencing coverage and are usually not able to fully reconstruct the haplotypes. In this work, we present a novel approach, Hap-seq, which is simultaneously an imputation and assembly method that combines information from a reference dataset with the information from the reads using a likelihood framework. Our method applies a dynamic programming algorithm to identify the predicted haplotype, which maximizes the joint likelihood of the haplotype with respect to the reference dataset and the haplotype with respect to the observed reads. We show that our method requires only low sequencing coverage and can reconstruct haplotypes containing both common and rare alleles with higher accuracy compared to the state-of-the-art imputation methods}, keywords = {Haplotyping from Sequences, Imputation}, pubstate = {published}, tppubtype = {article} } Abstract Inference of haplotypes, or the sequence of alleles along each chromosome, is a fundamental problem in genetics and is important for many analyses, including admixture mapping, identifying regions of identity by descent, and imputation. Traditionally, haplotypes are inferred from genotype data obtained from microarrays using information on population haplotype frequencies inferred from either a large sample of genotyped individuals or a reference dataset such as the HapMap. Since the availability of large reference datasets, modern approaches for haplotype phasing along these lines are closely related to imputation methods. When applied to data obtained from sequencing studies, a straightforward way to obtain haplotypes is to first infer genotypes from the sequence data and then apply an imputation method. However, this approach does not take into account that alleles on the same sequence read originate from the same chromosome. Haplotype assembly approaches take advantage of this insight and predict haplotypes by assigning the reads to chromosomes in such a way that minimizes the number of conflicts between the reads and the predicted haplotypes. Unfortunately, assembly approaches require very high sequencing coverage and are usually not able to fully reconstruct the haplotypes. In this work, we present a novel approach, Hap-seq, which is simultaneously an imputation and assembly method that combines information from a reference dataset with the information from the reads using a likelihood framework. Our method applies a dynamic programming algorithm to identify the predicted haplotype, which maximizes the joint likelihood of the haplotype with respect to the reference dataset and the haplotype with respect to the observed reads. We show that our method requires only low sequencing coverage and can reconstruct haplotypes containing both common and rare alleles with higher accuracy compared to the state-of-the-art imputation methods |
Yang, Wen-Yun Y; Hormozdiari, Farhad ; Wang, Zhanyong ; He, Dan ; Pasaniuc, Bogdan ; Eskin, Eleazar Leveraging Multi-SNP Reads from Sequencing Data for Haplotype Inference. Journal Article Bioinformatics, 2013, ISSN: 1367-4811. Abstract | Links | BibTeX | Tags: Haplotype Phasing, Haplotyping from Sequences, Imputation @article{Yang:Bioinformatics:2013, title = {Leveraging Multi-SNP Reads from Sequencing Data for Haplotype Inference.}, author = { Wen-Yun Y. Yang and Farhad Hormozdiari and Zhanyong Wang and Dan He and Bogdan Pasaniuc and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/btt386}, issn = {1367-4811}, year = {2013}, date = {2013-01-01}, journal = {Bioinformatics}, organization = {Department of Computer Science,University of California, Los Angeles.}, abstract = {MOTIVATION: Haplotypes, defined as the sequence of alleles on one chromosome, are crucial for many genetic analyses. Since experimental determination of haplotypes is extremely expensive, haplotypes are traditionally inferred using computational approaches from genotype data, i.e. the mixture of the genetic information from both haplotypes. Best performing approaches for haplotype inference rely on Hidden Markov Models (HMMs), with the underlying assumption that the haplotypes of a given individual can be represented as a mosaic of segments from other haplotypes in the same population. Such algorithms utilize this model to predict the most likely haplotypes that explain the observed genotype data conditional on reference panel of haplotypes. With rapid advances in short read sequencing technologies, sequencing is quickly establishing as a powerful approach for collecting genetic variation information. As opposed to traditional genotyping-array technologies that independently calls genotypes at polymorphic sites, short read sequencing often collects haplotypic information; a read spanning more than one polymorphic locus (multi-SNP read) contains information on the haplotype from which the read originates. However, this information is generally ignored in existing approaches for haplotype phasing and genotype-calling from short read data. RESULTS: In this paper, we propose a novel framework for haplotype inference from short read sequencing that leverages multi-SNP reads together with a reference panel of haplotypes. The basis of our approach is a new probabilistic model that finds the most likely haplotype segments from the reference panel to explain the short read sequencing data for a given individual. We devised an efficient sampling method within a probabilistic model to achieve superior performance than existing methods. Using simulated sequencing reads from real individual genotypes in the HapMap data and the 1000 Genomes projects, we show that our method is highly accurate and computationally efficient. Our haplotype predictions improve accuracy over the basic haplotype copying model by around 20% with comparable computational time, and over another recently proposed approach Hap-SeqX by around 10% with significantly reduced computational time and memory usage. AVAILABILITY: Publicly available software is available at http://genetics.cs.ucla.edu/harsh CONTACT: bpasaniuc@mednet.ucla.edu; eeskin@cs.ucla.edu}, keywords = {Haplotype Phasing, Haplotyping from Sequences, Imputation}, pubstate = {published}, tppubtype = {article} } MOTIVATION: Haplotypes, defined as the sequence of alleles on one chromosome, are crucial for many genetic analyses. Since experimental determination of haplotypes is extremely expensive, haplotypes are traditionally inferred using computational approaches from genotype data, i.e. the mixture of the genetic information from both haplotypes. Best performing approaches for haplotype inference rely on Hidden Markov Models (HMMs), with the underlying assumption that the haplotypes of a given individual can be represented as a mosaic of segments from other haplotypes in the same population. Such algorithms utilize this model to predict the most likely haplotypes that explain the observed genotype data conditional on reference panel of haplotypes. With rapid advances in short read sequencing technologies, sequencing is quickly establishing as a powerful approach for collecting genetic variation information. As opposed to traditional genotyping-array technologies that independently calls genotypes at polymorphic sites, short read sequencing often collects haplotypic information; a read spanning more than one polymorphic locus (multi-SNP read) contains information on the haplotype from which the read originates. However, this information is generally ignored in existing approaches for haplotype phasing and genotype-calling from short read data. RESULTS: In this paper, we propose a novel framework for haplotype inference from short read sequencing that leverages multi-SNP reads together with a reference panel of haplotypes. The basis of our approach is a new probabilistic model that finds the most likely haplotype segments from the reference panel to explain the short read sequencing data for a given individual. We devised an efficient sampling method within a probabilistic model to achieve superior performance than existing methods. Using simulated sequencing reads from real individual genotypes in the HapMap data and the 1000 Genomes projects, we show that our method is highly accurate and computationally efficient. Our haplotype predictions improve accuracy over the basic haplotype copying model by around 20% with comparable computational time, and over another recently proposed approach Hap-SeqX by around 10% with significantly reduced computational time and memory usage. AVAILABILITY: Publicly available software is available at http://genetics.cs.ucla.edu/harsh CONTACT: bpasaniuc@mednet.ucla.edu; eeskin@cs.ucla.edu |
2012 |
He, Dan; Han, Buhm ; Eskin, Eleazar Hap-seq: An Optimal Algorithm for Haplotype Phasing with Imputation Using Sequencing Data Conference Research in Computational Molecular Biology, Univ. of California Springer Berlin Heidelberg, 2012. Abstract | Links | BibTeX | Tags: Haplotyping from Sequences @conference{He:ResearchInComputationalMolecularBiology:2012, title = {Hap-seq: An Optimal Algorithm for Haplotype Phasing with Imputation Using Sequencing Data}, author = { Dan He and Buhm Han and Eleazar Eskin}, url = {http://dx.doi.org/10.1007/978-3-642-29627-7_8}, year = {2012}, date = {2012-01-01}, booktitle = {Research in Computational Molecular Biology}, pages = {64-78}, publisher = {Springer Berlin Heidelberg}, organization = {Univ. of California}, abstract = {Inference of haplotypes, or the sequence of alleles along each chromosome, is a fundamental problem in genetics and is important for many analyses including admixture mapping, identifying regions of identity by descent and imputation. Traditionally, haplotypes were inferred from genotype data obtained from microarrays utilizing information on population haplotype frequencies inferred from either a large sample of genotyped individuals or a reference dataset such as the HapMap. Since the availability of large reference datasets, modern approaches for haplotype phasing along these lines are closely related to imputation methods. When applied to data obtained from sequencing studies, a straightforward way to obtain haplotypes is to first infer genotypes from the sequence data and then apply an imputation method. However, this approach does not take into account that alleles on the same sequence read originate from the same chromosome. Haplotype assembly approaches take advantage of this insight and predict haplotypes by assigning the reads to chromosomes in such a way that minimizes the number of conflicts between the reads and the predicted haplotypes. Unfortunately, assembly approaches require very high sequencing coverage and are usually not able to fully reconstruct the haplotypes. In this work, we present a novel approach, Hap-seq, which is simultaneously an imputation and assembly method which combines information from a reference dataset with the information from the reads using a likelihood framework. Our method applies a dynamic programming algorithm to identify the predicted haplotype which maximizes the joint likelihood of the haplotype with respect to the reference dataset and the haplotype with respect to the observed reads. We show that our method requires only low sequencing coverage and can reconstruct haplotypes containing both common and rare alleles with higher accuracy compared to the state-of-the-art imputation methods.}, keywords = {Haplotyping from Sequences}, pubstate = {published}, tppubtype = {conference} } Inference of haplotypes, or the sequence of alleles along each chromosome, is a fundamental problem in genetics and is important for many analyses including admixture mapping, identifying regions of identity by descent and imputation. Traditionally, haplotypes were inferred from genotype data obtained from microarrays utilizing information on population haplotype frequencies inferred from either a large sample of genotyped individuals or a reference dataset such as the HapMap. Since the availability of large reference datasets, modern approaches for haplotype phasing along these lines are closely related to imputation methods. When applied to data obtained from sequencing studies, a straightforward way to obtain haplotypes is to first infer genotypes from the sequence data and then apply an imputation method. However, this approach does not take into account that alleles on the same sequence read originate from the same chromosome. Haplotype assembly approaches take advantage of this insight and predict haplotypes by assigning the reads to chromosomes in such a way that minimizes the number of conflicts between the reads and the predicted haplotypes. Unfortunately, assembly approaches require very high sequencing coverage and are usually not able to fully reconstruct the haplotypes. In this work, we present a novel approach, Hap-seq, which is simultaneously an imputation and assembly method which combines information from a reference dataset with the information from the reads using a likelihood framework. Our method applies a dynamic programming algorithm to identify the predicted haplotype which maximizes the joint likelihood of the haplotype with respect to the reference dataset and the haplotype with respect to the observed reads. We show that our method requires only low sequencing coverage and can reconstruct haplotypes containing both common and rare alleles with higher accuracy compared to the state-of-the-art imputation methods. |
He, Dan; Eskin, Eleazar Hap-seqX: Expedite Algorithm for Haplotype Phasing with Imputation using Sequence Data. Journal Article Gene, 2012, ISSN: 1879-0038. Abstract | Links | BibTeX | Tags: Haplotyping from Sequences, Imputation @article{He:Gene:2012, title = {Hap-seqX: Expedite Algorithm for Haplotype Phasing with Imputation using Sequence Data.}, author = { Dan He and Eleazar Eskin}, url = {http://dx.doi.org/10.1016/j.gene.2012.11.093}, issn = {1879-0038}, year = {2012}, date = {2012-01-01}, journal = {Gene}, organization = {IBM T.J. Watson Research, Yorktown Heights, NY, USA. Electronic address: dhe@us.ibm.com.}, abstract = {Haplotype phasing is one of the most important problems in population genetics as haplotypes can be used to estimate the relatedness of individuals and to impute genotype information which is a commonly performed analysis when searching for variants involved in disease. The problem of haplotype phasing has been well studied. Methodologies for haplotype inference from sequencing data either combine a set of reference haplotypes and collected genotypes using a hidden Markov model or assemble haplotypes by overlapping sequencing reads. A recent algorithm Hap-seq considers using both sequencing data and reference haplotypes and it is a hybrid of a dynamic programming algorithm and an Hidden Markov Model (HMM), which is shown to be optimal. However, the algorithm requires extremely large amount of memory which is not practical for whole genome datasets. The current algorithm requires saving intermediate results to disk and reads back these results when needed, which significantly affects the practicality of the algorithm. In this work, we proposed the expedited version of the algorithm Hap-seqX, which addressed the memory issue by using a posterior probability to select the records that should be saved in memory. We show that Hap-seqX can save all intermediate results in memory and improves the execution time of the algorithm dramatically. Utilizing the strategy, Hap-seqX is able to predict haplotypes from whole genome sequencing data}, keywords = {Haplotyping from Sequences, Imputation}, pubstate = {published}, tppubtype = {article} } Haplotype phasing is one of the most important problems in population genetics as haplotypes can be used to estimate the relatedness of individuals and to impute genotype information which is a commonly performed analysis when searching for variants involved in disease. The problem of haplotype phasing has been well studied. Methodologies for haplotype inference from sequencing data either combine a set of reference haplotypes and collected genotypes using a hidden Markov model or assemble haplotypes by overlapping sequencing reads. A recent algorithm Hap-seq considers using both sequencing data and reference haplotypes and it is a hybrid of a dynamic programming algorithm and an Hidden Markov Model (HMM), which is shown to be optimal. However, the algorithm requires extremely large amount of memory which is not practical for whole genome datasets. The current algorithm requires saving intermediate results to disk and reads back these results when needed, which significantly affects the practicality of the algorithm. In this work, we proposed the expedited version of the algorithm Hap-seqX, which addressed the memory issue by using a posterior probability to select the records that should be saved in memory. We show that Hap-seqX can save all intermediate results in memory and improves the execution time of the algorithm dramatically. Utilizing the strategy, Hap-seqX is able to predict haplotypes from whole genome sequencing data |
2010 |
He, Dan; Choi, Arthur ; Pipatsrisawat, Knot ; Darwiche, Adnan ; Eskin, Eleazar Optimal algorithms for haplotype assembly from whole-genome sequence data. Journal Article Bioinformatics, 26 (12), pp. i183-90, 2010, ISSN: 1367-4811. Abstract | Links | BibTeX | Tags: Haplotyping from Sequences @article{He:Bioinformatics:2010, title = {Optimal algorithms for haplotype assembly from whole-genome sequence data.}, author = { Dan He and Arthur Choi and Knot Pipatsrisawat and Adnan Darwiche and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/btq215}, issn = {1367-4811}, year = {2010}, date = {2010-01-01}, journal = {Bioinformatics}, volume = {26}, number = {12}, pages = {i183-90}, address = {England}, organization = {Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA. danhe@cs.ucla.edu}, abstract = {MOTIVATION: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of 'haplotype assembly' is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem. RESULTS: In this article, we proposed a dynamic programming algorithm that is able to assemble the haplotypes optimally with time complexity O(m x 2(k) x n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotypes. We also reduce the haplotype assembly problem into the maximum satisfiability problem that can often be solved optimally even when k is large. Taking advantage of the efficiency of our algorithm, we perform simulation experiments demonstrating that the assembly of haplotypes using reads of length typical of the current sequencing technologies is not practical. However, we demonstrate that the combination of this approach and the traditional haplotype phasing approaches allow us to practically construct haplotypes containing both common and rare variants.}, keywords = {Haplotyping from Sequences}, pubstate = {published}, tppubtype = {article} } MOTIVATION: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of 'haplotype assembly' is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem. RESULTS: In this article, we proposed a dynamic programming algorithm that is able to assemble the haplotypes optimally with time complexity O(m x 2(k) x n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotypes. We also reduce the haplotype assembly problem into the maximum satisfiability problem that can often be solved optimally even when k is large. Taking advantage of the efficiency of our algorithm, we perform simulation experiments demonstrating that the assembly of haplotypes using reads of length typical of the current sequencing technologies is not practical. However, we demonstrate that the combination of this approach and the traditional haplotype phasing approaches allow us to practically construct haplotypes containing both common and rare variants. |