Haplotype Phasing from Sequence Data

The haplotype phasing problem.

The classical haplotype phasing problem.

Over the past few years, our group has written several papers on inferring haplotypes from sequence data.

The problem of Haplotype Inference referred to as Haplotype Phasing has had a long history in computational genetics and the problem itself has had several incarnations.  Genotyping technologies obtain “genotype” information on SNPs which mixes the genetic information from both chromosomes.  However, many genetic analyses require “haplotype” information which is the genetic information on each chromosome (see Figure).

In the early days before reference datasets were available, methods would be applied to large numbers of genotyped individuals which would attempt to identify a small number of haplotypes which explained the majority of the individual genotypes.  Methods from this period include PHASE (11254454) and HAP (14988101) (from our group with Eran Halperin).  The figure is actually one of Eran’s slides from around 2002.

Once reference datasets such as the HapMap became available, imputation based methods such as IMPUTE(10.1038/ng2088) and BEAGLE(10.1016/j.ajhg.2009.01.005) dominated previous phasing approaches because they leveraged information from the carefully curated reference datasets.

In principal, haplotype phasing or imputation methods can be applied directly to sequencing data by first calling genotypes in the sequencing data and then applying a phasing or imputation approach.  However, since each read originates from only one chromosome, if a read spans two genotypes it provides some information on haplotype phase.  Combining these reads to construct haplotypes is referred to as the “haplotypes assembly” problem which was pioneered by   Vikas Bansal and Vineet Bafna(10.1093/bioinformatics/btn298),(10.1101/gr.077065.108).  Dan He in our group developed an optimal method for haplotype assembly which guarantees finding the optimal solution for short reads and reduces the problem of haplotype assembly for longer reads to MaxSAT which finds the optimal solution for the vast majority of problem instances(10.1093/bioinformatics/btq215). More recently, others have developed methods that can discover optimal solutions for all problem instances(10.1093/bioinformatics/btt349). In his paper, Dan also showed that haplotype assembly will always underperform traditional phasing methods for short read sequencing data because too few of the reads span multiple genotypes.

To overcome this issue, Dan extended his methods to jointly perform imputation and haplotype assembly(10.1089/cmb.2012.0091),(10.1016/j.gene.2012.11.093).  These methods outperformed both imputation methods and haplotype assembly methods but unfortunately are too slow and memory intensive to apply in practice.  More recently, in our group, Wen-Yun Yang, Zhanyong Wang, Farhad Hormozdiari with Bogdan Pasaniuc developed a sampling method which is both fast and accurate for combining haplotype assembly and imputation(10.1093/bioinformatics/btt386).

Full citations of our papers are here:

Sorry, no publications matched your criteria.

Bibliography

Heterogeneity and Meta-Analysis

Figure5_crohns_forest_small

Visualizing heterogeneity in meta-analyses of GWAS. The left panel shows a forest plot which shows the predicted effect size and standard error for each study. The right panel shows a PM-plot which for each study plots the p-value on the y-axis and the m-value on the x-axis. M-values have the following interpretations: Small m-value (e.g. < 0.1) suggest the study does not have an effect. Large m-value (e.g. > 0.9) suggest the study is predicted to have an effect. Otherwise the prediction is ambiguous.

Over the past couple of years, a major focus of our group has been on meta-analysis. These efforts have been led by Buhm Han who is a graduate of our group and now a post-doc at the Broad Institute.

Meta-Analysis is a statistical method to combine the results of many statistical studies.  Meta-analysis has the advantage that the statistical power of the combination of the studies is much higher than the statistical power of any individual studies.  In fact, the majority of the recently identified genetic variants associated with complex diseases have been discovered using meta-analysis (10.1146/annurev-genom-091212-153520) since most of the effect sizes of these variants are too small to discover in the sample sizes of the individual studies.

Standard meta-analysis techniques assume what is referred to as the “fixed effect model” (FE). In the FE model, the effect size in each study is assumed to the the same. In the case of genetic association studies, this is an unrealistic assumption because the studies are often collected in very different populations which are subject to very different environmental conditions. An alternate model is the “random effects model” (RE) where the effect size are assumed to be different in each study and the effect sizes are modeled as being drawn from a distribution with an estimated mean and variance. This difference in effect sizes between studies is referred to as “heterogeneity.”

Buhm Han, in our group, made two contributions related to heterogeneity in meta-analysis. In his first paper, he noticed that previous approaches for hypothesis testing using the RE model did not correctly model the null hypothesis and led to a significant loss in power(10.1016/j.ajhg.2011.04.014). His second paper presented a method for helping interpret meta-analysis studies to identify in which studies an effect is present and in which studies an effect is not present(10.1371/journal.pgen.1002555).  One aspect of the interpretation framework is the m-value which can be used to identify in which studies an effect is present and a summary of the heterogeneity of the meta-analysis can be visualized utilizing a PM-plot (see figure).

The methods are implemented in the software that Buhm developed, METASOFT, available at http://genetics.cs.ucla.edu/meta/.

The full citations to his papers are below:

Sorry, no publications matched your criteria.

Bibliography