Our group publishes papers presenting new methodologies, describing the results of studies that use our software, and reviewing current topics in the field of Bioinformatics. Scroll down or click here for a complete list of papers produced by our lab. Since 2013, we write blog posts summarizing new research papers and review articles:
GWAS
- Fine Mapping Causal Variants and Allelic Heterogeneity
- Widespread Allelic Heterogeneity in Complex Traits
- Selection in Europeans on Fatty Acid Desaturases Associated with Dietary Changes
- Incorporating prior information into association studies
- Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder
- Simultaneous modeling of disease status and clinical phenotypes to increase power in GWAS
- Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Colocalization of GWAS and eQTL Signals Detects Target Genes
- Chromosome conformation elucidates regulatory relationships in developing human brain
Mouse Genetics
- Review Article: The Hybrid Mouse Diversity Panel
- Genes, Environments and Meta-Analysis
- Review Article: Mixed Models and Population Structure
- Identifying Genes Involved in Blood Cell Traits
- Genes, Diet, and Body Weight (in Mice)
- Review Article: Mouse Genetics
Population Structure
- Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models
- Multiple testing correction in linear mixed models
- Identification of causal genes for complex traits (CAVIAR-gene)
- Accurate viral population assembly from ultra-deep sequencing data
- GRAT: Speeding up Expression Quantitative Trail Loci (eQTL) Studies
- Correcting Population Structure using Mixed Models Webcast
- Mixed models can correct for population structure for genomic regions under selection
Review Articles
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Review Article: The Hybrid Mouse Diversity Panel
- Review Article: GWAS and Missing Heritability
- Review Article: Mixed Models and Population Structure
- Review Article: Mouse Genetics
Publications
2017 |
Mangul, Serghei; Yang, Harry Taegyun; Hormozdiari, Farhad; Dainis, Alex; Tseng, Elizabeth; Ashley, Euan A; Zelikovsky, Alex; Eskin, Eleazar HapIso : An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads. Journal Article IEEE Trans Nanobioscience, 16 (2), pp. 108-115, 2017, ISSN: 1558-2639. Abstract | Links | BibTeX | Tags: Allele Specific Expression, Haplotype Phasing, Haplotyping from Sequences, RNAseq, Sequence Assembly @article{Mangul:IeeeTransNanobioscience:2017, title = {HapIso : An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads.}, author = { Serghei Mangul and Harry Taegyun Yang and Farhad Hormozdiari and Alex Dainis and Elizabeth Tseng and Euan A. Ashley and Alex Zelikovsky and Eleazar Eskin}, url = {http://dx.doi.org/10.1109/TNB.2017.2675981}, issn = {1558-2639}, year = {2017}, date = {2017-01-01}, journal = {IEEE Trans Nanobioscience}, volume = {16}, number = {2}, pages = {108-115}, address = {United States}, abstract = {Sequencing of RNA provides the possibility to study an individual's transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present HapIso (Haplotype-specific Isoform Reconstruction), a method able to tolerate the relatively high error-rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k-means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We used family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error-rate and accurately partition the reads into the parental alleles of the isoform transcripts. We also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate ASE of genes of interest. Our method was able to correct reads and determine Glu1883Lys point mutation of clinical signifcance validated by GeneDx HCM Panel. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads}, keywords = {Allele Specific Expression, Haplotype Phasing, Haplotyping from Sequences, RNAseq, Sequence Assembly}, pubstate = {published}, tppubtype = {article} } Sequencing of RNA provides the possibility to study an individual's transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present HapIso (Haplotype-specific Isoform Reconstruction), a method able to tolerate the relatively high error-rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k-means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We used family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error-rate and accurately partition the reads into the parental alleles of the isoform transcripts. We also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate ASE of genes of interest. Our method was able to correct reads and determine Glu1883Lys point mutation of clinical signifcance validated by GeneDx HCM Panel. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads |
2013 |
Yang, Wen-Yun Y; Hormozdiari, Farhad ; Wang, Zhanyong ; He, Dan ; Pasaniuc, Bogdan ; Eskin, Eleazar Leveraging Multi-SNP Reads from Sequencing Data for Haplotype Inference. Journal Article Bioinformatics, 2013, ISSN: 1367-4811. Abstract | Links | BibTeX | Tags: Haplotype Phasing, Haplotyping from Sequences, Imputation @article{Yang:Bioinformatics:2013, title = {Leveraging Multi-SNP Reads from Sequencing Data for Haplotype Inference.}, author = { Wen-Yun Y. Yang and Farhad Hormozdiari and Zhanyong Wang and Dan He and Bogdan Pasaniuc and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/btt386}, issn = {1367-4811}, year = {2013}, date = {2013-01-01}, journal = {Bioinformatics}, organization = {Department of Computer Science,University of California, Los Angeles.}, abstract = {MOTIVATION: Haplotypes, defined as the sequence of alleles on one chromosome, are crucial for many genetic analyses. Since experimental determination of haplotypes is extremely expensive, haplotypes are traditionally inferred using computational approaches from genotype data, i.e. the mixture of the genetic information from both haplotypes. Best performing approaches for haplotype inference rely on Hidden Markov Models (HMMs), with the underlying assumption that the haplotypes of a given individual can be represented as a mosaic of segments from other haplotypes in the same population. Such algorithms utilize this model to predict the most likely haplotypes that explain the observed genotype data conditional on reference panel of haplotypes. With rapid advances in short read sequencing technologies, sequencing is quickly establishing as a powerful approach for collecting genetic variation information. As opposed to traditional genotyping-array technologies that independently calls genotypes at polymorphic sites, short read sequencing often collects haplotypic information; a read spanning more than one polymorphic locus (multi-SNP read) contains information on the haplotype from which the read originates. However, this information is generally ignored in existing approaches for haplotype phasing and genotype-calling from short read data. RESULTS: In this paper, we propose a novel framework for haplotype inference from short read sequencing that leverages multi-SNP reads together with a reference panel of haplotypes. The basis of our approach is a new probabilistic model that finds the most likely haplotype segments from the reference panel to explain the short read sequencing data for a given individual. We devised an efficient sampling method within a probabilistic model to achieve superior performance than existing methods. Using simulated sequencing reads from real individual genotypes in the HapMap data and the 1000 Genomes projects, we show that our method is highly accurate and computationally efficient. Our haplotype predictions improve accuracy over the basic haplotype copying model by around 20% with comparable computational time, and over another recently proposed approach Hap-SeqX by around 10% with significantly reduced computational time and memory usage. AVAILABILITY: Publicly available software is available at http://genetics.cs.ucla.edu/harsh CONTACT: bpasaniuc@mednet.ucla.edu; eeskin@cs.ucla.edu}, keywords = {Haplotype Phasing, Haplotyping from Sequences, Imputation}, pubstate = {published}, tppubtype = {article} } MOTIVATION: Haplotypes, defined as the sequence of alleles on one chromosome, are crucial for many genetic analyses. Since experimental determination of haplotypes is extremely expensive, haplotypes are traditionally inferred using computational approaches from genotype data, i.e. the mixture of the genetic information from both haplotypes. Best performing approaches for haplotype inference rely on Hidden Markov Models (HMMs), with the underlying assumption that the haplotypes of a given individual can be represented as a mosaic of segments from other haplotypes in the same population. Such algorithms utilize this model to predict the most likely haplotypes that explain the observed genotype data conditional on reference panel of haplotypes. With rapid advances in short read sequencing technologies, sequencing is quickly establishing as a powerful approach for collecting genetic variation information. As opposed to traditional genotyping-array technologies that independently calls genotypes at polymorphic sites, short read sequencing often collects haplotypic information; a read spanning more than one polymorphic locus (multi-SNP read) contains information on the haplotype from which the read originates. However, this information is generally ignored in existing approaches for haplotype phasing and genotype-calling from short read data. RESULTS: In this paper, we propose a novel framework for haplotype inference from short read sequencing that leverages multi-SNP reads together with a reference panel of haplotypes. The basis of our approach is a new probabilistic model that finds the most likely haplotype segments from the reference panel to explain the short read sequencing data for a given individual. We devised an efficient sampling method within a probabilistic model to achieve superior performance than existing methods. Using simulated sequencing reads from real individual genotypes in the HapMap data and the 1000 Genomes projects, we show that our method is highly accurate and computationally efficient. Our haplotype predictions improve accuracy over the basic haplotype copying model by around 20% with comparable computational time, and over another recently proposed approach Hap-SeqX by around 10% with significantly reduced computational time and memory usage. AVAILABILITY: Publicly available software is available at http://genetics.cs.ucla.edu/harsh CONTACT: bpasaniuc@mednet.ucla.edu; eeskin@cs.ucla.edu |
2006 |
Marchini, Jonathan; Cutler, David ; Patterson, Nick ; Stephens, Matthew ; Eskin, Eleazar ; Halperin, Eran ; Lin, Shin ; Qin, Zhaohui S; Munro, Heather M; Abecasis, Goncalo R; Donnelly, Peter ; Consortium, International HapMap A comparison of phasing algorithms for trios and unrelated individuals. Journal Article Am J Hum Genet, 78 (3), pp. 437-50, 2006, ISSN: 0002-9297. Abstract | Links | BibTeX | Tags: Haplotype Phasing @article{Marchini:AmJHumGenet:2006, title = {A comparison of phasing algorithms for trios and unrelated individuals.}, author = { Jonathan Marchini and David Cutler and Nick Patterson and Matthew Stephens and Eleazar Eskin and Eran Halperin and Shin Lin and Zhaohui S. Qin and Heather M. Munro and Goncalo R. Abecasis and Peter Donnelly and International HapMap Consortium}, url = {http://dx.doi.org/10.1086/500808}, issn = {0002-9297}, year = {2006}, date = {2006-01-01}, journal = {Am J Hum Genet}, volume = {78}, number = {3}, pages = {437-50}, address = {United States}, organization = {Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom. marchini@stats.ox.ac.uk}, abstract = {Knowledge of haplotype phase is valuable for many analysis methods in the study of disease, population, and evolutionary genetics. Considerable research effort has been devoted to the development of statistical and computational methods that infer haplotype phase from genotype data. Although a substantial number of such methods have been developed, they have focused principally on inference from unrelated individuals, and comparisons between methods have been rather limited. Here, we describe the extension of five leading algorithms for phase inference for handling father-mother-child trios. We performed a comprehensive assessment of the methods applied to both trios and to unrelated individuals, with a focus on genomic-scale problems, using both simulated data and data from the HapMap project. The most accurate algorithm was PHASE (v2.1). For this method, the percentages of genotypes whose phase was incorrectly inferred were 0.12%, 0.05%, and 0.16% for trios from simulated data, HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trios, and HapMap Yoruban trios, respectively, and 5.2% and 5.9% for unrelated individuals in simulated data and the HapMap CEPH data, respectively. The other methods considered in this work had comparable but slightly worse error rates. The error rates for trios are similar to the levels of genotyping error and missing data expected. We thus conclude that all the methods considered will provide highly accurate estimates of haplotypes when applied to trio data sets. Running times differ substantially between methods. Although it is one of the slowest methods, PHASE (v2.1) was used to infer haplotypes for the 1 million-SNP HapMap data set. Finally, we evaluated methods of estimating the value of r(2) between a pair of SNPs and concluded that all methods estimated r(2) well when the estimated value was >or=0.8.}, keywords = {Haplotype Phasing}, pubstate = {published}, tppubtype = {article} } Knowledge of haplotype phase is valuable for many analysis methods in the study of disease, population, and evolutionary genetics. Considerable research effort has been devoted to the development of statistical and computational methods that infer haplotype phase from genotype data. Although a substantial number of such methods have been developed, they have focused principally on inference from unrelated individuals, and comparisons between methods have been rather limited. Here, we describe the extension of five leading algorithms for phase inference for handling father-mother-child trios. We performed a comprehensive assessment of the methods applied to both trios and to unrelated individuals, with a focus on genomic-scale problems, using both simulated data and data from the HapMap project. The most accurate algorithm was PHASE (v2.1). For this method, the percentages of genotypes whose phase was incorrectly inferred were 0.12%, 0.05%, and 0.16% for trios from simulated data, HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trios, and HapMap Yoruban trios, respectively, and 5.2% and 5.9% for unrelated individuals in simulated data and the HapMap CEPH data, respectively. The other methods considered in this work had comparable but slightly worse error rates. The error rates for trios are similar to the levels of genotyping error and missing data expected. We thus conclude that all the methods considered will provide highly accurate estimates of haplotypes when applied to trio data sets. Running times differ substantially between methods. Although it is one of the slowest methods, PHASE (v2.1) was used to infer haplotypes for the 1 million-SNP HapMap data set. Finally, we evaluated methods of estimating the value of r(2) between a pair of SNPs and concluded that all methods estimated r(2) well when the estimated value was >or=0.8. |
Eskin, Eleazar; Sharan, Roded; Halperin, Eran A note on phasing long genomic regions using local haplotype predictions. Journal Article J Bioinform Comput Biol, 4 (3), pp. 639-47, 2006, ISSN: 0219-7200. Abstract | Links | BibTeX | Tags: Haplotype Phasing @article{Eskin:JBioinformComputBiol:2006, title = {A note on phasing long genomic regions using local haplotype predictions.}, author = { Eleazar Eskin and Roded Sharan and Eran Halperin}, url = {https://www.ncbi.nlm.nih.gov/pubmed/16960967}, issn = {0219-7200}, year = {2006}, date = {2006-01-01}, journal = {J Bioinform Comput Biol}, volume = {4}, number = {3}, pages = {639-47}, address = {England}, organization = {Computer Science and Engineering, University of California, San Diego, USA. eeskin@cs.ucsd.edu}, abstract = {The common approaches for haplotype inference from genotype data are targeted toward phasing short genomic regions. Longer regions are often tackled in a heuristic manner, due to the high computational cost. Here, we describe a novel approach for phasing genotypes over long regions, which is based on combining information from local predictions on short, overlapping regions. The phasing is done in a way, which maximizes a natural maximum likelihood criterion. Among other things, this criterion takes into account the physical length between neighboring single nucleotide polymorphisms. The approach is very efficient and is applied to several large scale datasets and is shown to be successful in two recent benchmarking studies (Zaitlen et al., in press; Marchini et al., in preparation). Our method is publicly available via a webserver at http://research.calit2.net/hap/.}, keywords = {Haplotype Phasing}, pubstate = {published}, tppubtype = {article} } The common approaches for haplotype inference from genotype data are targeted toward phasing short genomic regions. Longer regions are often tackled in a heuristic manner, due to the high computational cost. Here, we describe a novel approach for phasing genotypes over long regions, which is based on combining information from local predictions on short, overlapping regions. The phasing is done in a way, which maximizes a natural maximum likelihood criterion. Among other things, this criterion takes into account the physical length between neighboring single nucleotide polymorphisms. The approach is very efficient and is applied to several large scale datasets and is shown to be successful in two recent benchmarking studies (Zaitlen et al., in press; Marchini et al., in preparation). Our method is publicly available via a webserver at http://research.calit2.net/hap/. |
2005 |
Zaitlen, Noah A; Kang, Hyun Min ; Feolo, Michael L; Sherry, Stephen T; Halperin, Eran ; Eskin, Eleazar Inference and analysis of haplotypes from combined genotyping studies deposited in dbSNP. Journal Article Genome Res, 15 (11), pp. 1594-600, 2005, ISSN: 1088-9051. Abstract | Links | BibTeX | Tags: Haplotype Phasing @article{Zaitlen:GenomeRes:2005, title = {Inference and analysis of haplotypes from combined genotyping studies deposited in dbSNP.}, author = { Noah A. Zaitlen and Hyun Min Kang and Michael L. Feolo and Stephen T. Sherry and Eran Halperin and Eleazar Eskin}, url = {http://dx.doi.org/10.1101/gr.4297805}, issn = {1088-9051}, year = {2005}, date = {2005-01-01}, journal = {Genome Res}, volume = {15}, number = {11}, pages = {1594-600}, address = {United States}, organization = {Bioinformatics Program, University of California, San Diego, La Jolla, California 92093, USA.}, abstract = {In the attempt to understand human variation and the genetic basis of complex disease, a tremendous number of single nucleotide polymorphisms (SNPs) have been discovered and deposited into NCBI's dbSNP public database. More than 2.7 million SNPs in the database have genotype information. This data provides an invaluable resource for understanding the structure of human variation and the design of genetic association studies. The genotypes deposited to dbSNP are unphased, and thus, the haplotype information is unknown. We applied the phasing method HAP to obtain the haplotype information, block partitions, and tag SNPs for all publicly available genotype data and deposited this information into the dbSNP database. We also deposited the orthologous chimpanzee reference sequence for each predicted haplotype block computed using the UCSC BLASTZ alignments of human and chimpanzee. Using dbSNP, researchers can now easily perform analyses using multiple genotype data sets from the same genomic regions. Dense and sparse genotype data sets from the same region were combined to show that the number of common haplotypes is significantly underestimated in whole genome data sets, while the predicted haplotypes over the common SNPs are consistent between studies. To validate the accuracy of the predictions, we bench-marked HAP's running time and phasing accuracy against PHASE. Although HAP is slightly less accurate than PHASE, HAP is over 1000 times faster than PHASE, making it suitable for application to the entire set of genotypes in dbSNP.}, keywords = {Haplotype Phasing}, pubstate = {published}, tppubtype = {article} } In the attempt to understand human variation and the genetic basis of complex disease, a tremendous number of single nucleotide polymorphisms (SNPs) have been discovered and deposited into NCBI's dbSNP public database. More than 2.7 million SNPs in the database have genotype information. This data provides an invaluable resource for understanding the structure of human variation and the design of genetic association studies. The genotypes deposited to dbSNP are unphased, and thus, the haplotype information is unknown. We applied the phasing method HAP to obtain the haplotype information, block partitions, and tag SNPs for all publicly available genotype data and deposited this information into the dbSNP database. We also deposited the orthologous chimpanzee reference sequence for each predicted haplotype block computed using the UCSC BLASTZ alignments of human and chimpanzee. Using dbSNP, researchers can now easily perform analyses using multiple genotype data sets from the same genomic regions. Dense and sparse genotype data sets from the same region were combined to show that the number of common haplotypes is significantly underestimated in whole genome data sets, while the predicted haplotypes over the common SNPs are consistent between studies. To validate the accuracy of the predictions, we bench-marked HAP's running time and phasing accuracy against PHASE. Although HAP is slightly less accurate than PHASE, HAP is over 1000 times faster than PHASE, making it suitable for application to the entire set of genotypes in dbSNP. |
2004 |
Halperin, Eran; Eskin, Eleazar Haplotype reconstruction from genotype data using Imperfect Phylogeny. Journal Article Bioinformatics, 20 (12), pp. 1842-9, 2004, ISSN: 1367-4803. Abstract | Links | BibTeX | Tags: Haplotype Phasing @article{Halperin:Bioinformatics:2004, title = {Haplotype reconstruction from genotype data using Imperfect Phylogeny.}, author = { Eran Halperin and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/bth149}, issn = {1367-4803}, year = {2004}, date = {2004-01-01}, journal = {Bioinformatics}, volume = {20}, number = {12}, pages = {1842-9}, address = {England}, organization = {CS Division, University of California Berkeley, Berkeley, CA 92093-0114, USA. eran@eecs.berkeley.edu}, abstract = {Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize the genetic variation between different people, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes that shows that SNPs are organized in highly correlated 'blocks'. In a few recent studies, considerable parts of the human genome were partitioned into blocks, such that the majority of the sequenced genotypes have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks, and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (<2% over the data) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared with previous methods such as PHASE and HAPLOTYPER. Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large datasets. AVAILABILITY: The algorithm is available via a Web server at http://www.calit2.net/compbio/hap/}, keywords = {Haplotype Phasing}, pubstate = {published}, tppubtype = {article} } Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize the genetic variation between different people, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes that shows that SNPs are organized in highly correlated 'blocks'. In a few recent studies, considerable parts of the human genome were partitioned into blocks, such that the majority of the sequenced genotypes have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks, and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (<2% over the data) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared with previous methods such as PHASE and HAPLOTYPER. Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large datasets. AVAILABILITY: The algorithm is available via a Web server at http://www.calit2.net/compbio/hap/ |
2003 |
Eskin, Eleazar; Halperin, Eran; Karp, Richard M Efficient reconstruction of haplotype structure via perfect phylogeny. Journal Article J Bioinform Comput Biol, 1 (1), pp. 1-20, 2003, ISSN: 0219-7200. Abstract | Links | BibTeX | Tags: Haplotype Phasing @article{Eskin:JBioinformComputBiol:2003, title = {Efficient reconstruction of haplotype structure via perfect phylogeny.}, author = { Eleazar Eskin and Eran Halperin and Richard M. Karp}, url = {https://www.ncbi.nlm.nih.gov/pubmed/15290779}, issn = {0219-7200}, year = {2003}, date = {2003-01-01}, journal = {J Bioinform Comput Biol}, volume = {1}, number = {1}, pages = {1-20}, address = {England}, organization = {Computer Science Department, Columbia University, USA. eeskin@cs.columbia.edu}, abstract = {Each person's genome contains two copies of each chromosome, one inherited from the father and the other from the mother. A person's genotype specifies the pair of bases at each site, but does not specify which base occurs on which chromosome. The sequence of each chromosome separately is called a haplotype. The determination of the haplotypes within a population is essential for understanding genetic variation and the inheritance of complex diseases. The haplotype mapping project, a successor to the human genome project, seeks to determine the common haplotypes in the human population. Since experimental determination of a person's genotype is less expensive than determining its component haplotypes, algorithms are required for computing haplotypes from genotypes. Two observations aid in this process: first, the human genome contains short blocks within which only a few different haplotypes occur; second, as suggested by Gusfield, it is reasonable to assume that the haplotypes observed within a block have evolved according to a perfect phylogeny, in which at most one mutation event has occurred at any site, and no recombination occurred at the given region. We present a simple and efficient polynomial-time algorithm for inferring haplotypes from the genotypes of a set of individuals assuming a perfect phylogeny. Using a reduction to 2-SAT we extend this algorithm to handle constraints that apply when we have genotypes from both parents and child. We also present a hardness result for the problem of removing the minimum number of individuals from a population to ensure that the genotypes of the remaining individuals are consistent with a perfect phylogeny. Our algorithms have been tested on real data and give biologically meaningful results. Our webserver (http://www.cs.columbia.edu/compbio/hap/) is publicly available for predicting haplotypes from genotype data and partitioning genotype data into blocks.}, keywords = {Haplotype Phasing}, pubstate = {published}, tppubtype = {article} } Each person's genome contains two copies of each chromosome, one inherited from the father and the other from the mother. A person's genotype specifies the pair of bases at each site, but does not specify which base occurs on which chromosome. The sequence of each chromosome separately is called a haplotype. The determination of the haplotypes within a population is essential for understanding genetic variation and the inheritance of complex diseases. The haplotype mapping project, a successor to the human genome project, seeks to determine the common haplotypes in the human population. Since experimental determination of a person's genotype is less expensive than determining its component haplotypes, algorithms are required for computing haplotypes from genotypes. Two observations aid in this process: first, the human genome contains short blocks within which only a few different haplotypes occur; second, as suggested by Gusfield, it is reasonable to assume that the haplotypes observed within a block have evolved according to a perfect phylogeny, in which at most one mutation event has occurred at any site, and no recombination occurred at the given region. We present a simple and efficient polynomial-time algorithm for inferring haplotypes from the genotypes of a set of individuals assuming a perfect phylogeny. Using a reduction to 2-SAT we extend this algorithm to handle constraints that apply when we have genotypes from both parents and child. We also present a hardness result for the problem of removing the minimum number of individuals from a population to ensure that the genotypes of the remaining individuals are consistent with a perfect phylogeny. Our algorithms have been tested on real data and give biologically meaningful results. Our webserver (http://www.cs.columbia.edu/compbio/hap/) is publicly available for predicting haplotypes from genotype data and partitioning genotype data into blocks. |
Eskin, Eleazar; Halperin, Eran; Karp, Richard Large scale reconstruction of haplotypes from genotype data Conference RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology, ACM, New York, NY, USA, 2003, ISBN: 1-58113-635-8. Abstract | Links | BibTeX | Tags: Haplotype Phasing @conference{640088, title = {Large scale reconstruction of haplotypes from genotype data}, author = {Eleazar Eskin and Eran Halperin and Richard Karp}, url = {http://dx.doi.org/10.1145/640075.640088}, isbn = {1-58113-635-8}, year = {2003}, date = {2003-01-01}, booktitle = {RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology}, pages = {104-113}, publisher = {ACM}, address = {New York, NY, USA}, abstract = {Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize an individual's variation, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated "blocks". The majority of individuals have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (0.47%) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared to previous methods, (a matter of seconds where previous methods needed hours). Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large data sets such as genotypes for thousands of SNPs for hundreds of individuals. The algorithm is available via webserver at http://www.cs.columbia.edu/compbio/hap.}, keywords = {Haplotype Phasing}, pubstate = {published}, tppubtype = {conference} } Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize an individual's variation, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated "blocks". The majority of individuals have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (0.47%) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared to previous methods, (a matter of seconds where previous methods needed hours). Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large data sets such as genotypes for thousands of SNPs for hundreds of individuals. The algorithm is available via webserver at http://www.cs.columbia.edu/compbio/hap. |