Eun Yong Kang in our group defended his thesis on Monday Nov 25th, 2013. 2:30pm – 4:30pm in 4760 Boelter Hall.

The title of his defense was “Computational Genetic Approaches for Understanding the Genetic Architecture of Complex Traits”. The video of this defense is now available here. Fortunately for the lab, Eun is now a post-doc in the group.

The abstract of his thesis defense was:
Recent advances in genotyping and sequencing technology have enabled researchers to collect an enormous amount of high-dimensional genotype data. These large scale genomic data provide unprecedented opportunity for researchers to study and analyze the genetic factors of human complex traits. One of the major challenges in analyzing these high-dimensional genomic data is requiring effective and efficient computational methodologies. In this talk, I will focus on three problems that I have worked on. First, I will introduce a method for inferring biological networks from high-throughput data containing both genetic variation and gene expression profiles from genetically distinct strains of an organism. For this problem, I use causal inference techniques to infer the presence or absence of causal relationships between yeast gene expressions in the framework of graphical causal models. Second, I introduce efficient pairwise identity by descent (IBD) association mapping method, which utilizes importance sampling to improve efficiency and enable approximation of extremely small p-values. Using the WTCCC type 1 diabetes data, I show that Fast-Pairwise cansuccessfully pinpoint a gene known to be associated to the disease within the MHC region. Finally, I introduce a novel meta analytic approach (Meta-GxE) to identify gene-by-environment interactions by aggregating the multiple studies with varying environmental conditions. Meta-GxE approach jointly analyze multiple studies with varying environmental conditions using a meta-analytic approach based on a random effects model to identify loci involved in gene-by-environment interactions. This approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional uni- or multi-variate approaches for discovery of gene-by-environment interactions. Application of this approach to 17 mouse studies identify 26 significant loci involved in High-density lipoprotein (HDL) cholesterol, many of which show significant evidence of involvement in gene-by-environment interactions.

Eun’s talk covered the following papers:

Kang, Eun Yong; Han, Buhm; Furlotte, Nicholas; Joo, Jong Wha; Shih, Diana; Davis, Richard; Lusis, Aldons; Eskin, Eleazar (2014): Meta-Analysis Identifies Gene-by-Environment Interactions as Demonstrated in a Study of 4,965 Mice. In: PLoS Genet, 10 (1), pp. e1004022, 2014, ISSN: 1553-7404. (Type: Article | Abstract | Links | BibTeX)
Han, Buhm; Kang, Eun Yong; Raychaudhuri, Soumya; de Bakker, Paul; Eskin, Eleazar (2013): Fast Pairwise IBD Association Testing in Genome-wide Association Studies.. In: Bioinformatics, 2013, ISSN: 1367-4811. (Type: Article | Abstract | Links | BibTeX)
Kang, Eun Yong; Ye, Chun; Shpitser, Ilya; Eskin, Eleazar (2010): Detecting the presence and absence of causal relationships between expression of yeast genes with very few samples.. In: J Comput Biol, 17 (3), pp. 533-46, 2010, ISSN: 1557-8666. (Type: Article | Abstract | Links | BibTeX)

Tags: ,

Our DNA can tell us a lot about who our relatives are. Recently, several companies including 23andMe and AncestryDNA now provide services where they collect DNA from individuals and then match the DNA to a database of the DNA of other people to identify relatives. Relatives are then informed by the company that their DNAs match. Our lab was interested if we can perform this same type of service but without involving a company and more generally without involving any third party. One way to do this would be to have individuals obtain their own DNA sequences and then share their DNA sequences directly with each other. Unfortunately, DNA sequences are considered medical information and it is inappropriate to share them in this way.

Through a collaboration between our lab and the UCLA cryptography group, we recently published a paper that combines cryptography and genetics which describes an approach for identifying relatives without compromising privacy. Our paper was published in the April 2014 issue of Genome Research. The key ideas is that individuals release an encrypted version of their DNA information. Another individual can download this encrypted version and then use their own DNA information to try to decrypt it. If the are related to each other, their DNA sequences will be close enough that the decryption will work telling the individual that they are related. While if they are unrelated, the decryption will fail. What is important in this approach is that individuals who are not related do not obtain any information about each other’s DNA sequences.

The intuitive idea behind the approach is the following. Individuals each release a copy of their own genomes encrypted with a key that is based on the genome itself. Other users then download this encrypted information and try to decrypt it using their own genomes as the key. The encryption scheme is designed to allow for decryption if the encrypting key and decrypting key are “close enough”. Since related individuals share a portion of their genomes, we set the threshold for “close enough” to be exactly the threshold of relatedness that we want to detect.

Our approach uses a relatively new type of cryptographic technique called Fuzzy Extractors which were pioneered by our co-authors on this study, Amit Sahai and Rafail Ostrovsky. This type of technique allows for encryption and decryption with keys that match inexactly. Students in our group who were involved are Dan He, Nick Furlotte, Farhad Hormozdiari, and Jong Wha (Joanne) Joo. This research was supported by National Science Foundation grant 1065276.

The full citation of our paper is here:

He, Dan; Furlotte, Nicholas; Hormozdiari, Farhad; Joo, Jong Wha; Wadia, Akshay; Ostrovsky, Rafail; Sahai, Amit; Eskin, Eleazar (2014): Identifying genetic relatives without compromising privacy.. In: Genome Res, 2014, ISSN: 1549-5469. (Type: Article | Abstract | Links | BibTeX)

Tags: , , , , ,

mouse-phylogeny-slideI recently gave a talk on mixed models and confounding factors which is a long time interest of our research group at a workshop which is part of the Evolutionary Biology and the Theory of Computing program which was held at the Simons Institute on the UC Berkeley Campus. The talk was held on February 21st. This talk spans many years of work in our group including work by Hyun Min Kang (now at Michigan), Noah Zaitlen (now at UCSF), and Jimmie Ye (now at Harvard) as well as a sneak peak at very recent work by Joanne Joo, Jae-Hoon Sul and Buhm Han.

The video of the talk is available here and is also on our YouTube Channel ZarlabUCLA.

The papers which are covered in the talk include the EMMA, EMMAX and ICE papers published in 2008 as well as a very new paper that should be coming out soon. The key papers from the talk are:

Kang, Hyun Min; Sul, Jae Hoon; Service, Susan; Zaitlen, Noah; Kong, Sit-Yee; Freimer, Nelson; Sabatti, Chiara; Eskin, Eleazar (2010): Variance component model to account for sample structure in genome-wide association studies.. In: Nat Genet, 42 (4), pp. 348-54, 2010, ISSN: 1546-1718. (Type: Article | Abstract | Links | BibTeX)
Kang, Hyun Min; Zaitlen, Noah; Wade, Claire; Kirby, Andrew; Heckerman, David; Daly, Mark; Eskin, Eleazar (2008): Efficient control of population structure in model organism association mapping.. In: Genetics, 178 (3), pp. 1709-23, 2008, ISSN: 0016-6731. (Type: Article | Abstract | Links | BibTeX)
Kang, Hyun Min; Ye, Chun; Eskin, Eleazar (2008): Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots.. In: Genetics, 180 (4), pp. 1909-25, 2008, ISSN: 0016-6731. (Type: Article | Abstract | Links | BibTeX)

Tags: , , , , , ,

Figure 1. Application of Meta-GxE to Apoa2 locus. The forest plot (A) shows heterogeneity in the effect sizes across different studies. The PM- plot (B) predicts that 7 studies have an effect at this locus, even though only 1 study (HMDP-chow(M)) is genome-wide significant with P-value. doi:10.1371/journal.pgen.1004022.g001

It is well known that both genetic factors and environmental factors contribute to traits and specifically disease risk. In addition, an area of great interest in the research community is the interaction between genetic factors and environmental factors and their contribution to disease risk and other traits. Genetic variants that are involved in gene by environment interactions (denoted GxE) have a different effect on the trait spending on the environment. For example, some variants can have an effect on cholesterol levels only in the presence of a high fat diet. Discovering variants involved in GxE has been tremendously difficult and even though thousands of variants have been implicated in disease related traits using genome wide association studies, very few variants have been implicated in GxEs. Part of the difficulty in detecting GxEs is that the traditional approach requires analyzing studies which contain individuals with multiple environments.

We have recently published a paper with the A. Jake Lusis group in PLoS Genetics which presents a novel approach to discovering GxEs. In our approach, many different studies, each which was performed in different environments, are combined to identify GxEs. The key idea is that if variants have a different genetic effect in different environments, then these variants are candidates for being involved in GxEs. Combining studies together is a statistical technique called meta-analysis which has been a major focus of our lab the past few years. We show in the paper, the mathematically, searching for GxEs using the traditional approach and a type of meta-analysis framework called the random effects model(21565292) are very closely related.

We applied our approach to identify GxEs affected mouse HDL cholesterol by combining 17 mouse studies collected by A. Jake Lusis’ group containing almost 5,000 animals. Our approach discovered 26 loci involved in HDL, many of which appear to be involved in GxE. Virtually all of these loci were not previously discovered in any of the individual studies, but many of them map to genes known to affect HDL. Our approach also includes a visualization framework called a PM-plot which helps interpret the associated loci to help identify GxE interactions(22396665).

From the paper:

Discovering environmentally-specific loci using meta-analysis
The Meta-GxE strategy uses a meta-analytic approach to identify gene-by-environment inter- actions by combining studies that collect the same phenotype under different conditions. Our method consists of four steps. First, we apply a random effects model meta-analysis (RE) to identify loci associated with a trait considering all of the studies together. The RE method explicitly models the fact that loci may have different effects in different studies due to gene-by- environment interactions. Second, we apply a heterogeneity test to identify loci with significant gene-by-environment interactions. Third, we compute the m-value of each study to identify in which studies a given variant has an effect and in which it does not. Forth, we visualize the result through a forest plot and PM-plot to understand the underlying nature of gene-by-environment interactions.
We illustrate our methodology by examining a well-known region on mouse chromosome 1 harboring the Apoa2 gene, which is known to be strongly associated with HDL cholesterol (8332912). Figure 1 shows the results of applying our method to this locus. We first compute the effect size and its standard deviation for each of the 17 studies. These results are shown as a forest plot in Figure 1 (a). Second we compute the P-value for each individual study also shown in Figure 1 (a). If we were to follow traditional methodology and evaluate each study separately, we would declare an effect present in a study if the P-value exceeds a predefined genome-wide significance threshold (P < 1.0×10−6). In this case, we would only identify the locus as associated in a single study, HMDP-chow(M) (P = 6.84×10−9). On the other hand, in our approach, we combine all studies to compute a single P-value for each locus taking into account heterogeneity between studies. This approach leads to increased power over the simple approach considering each study separately. The combined meta P-value for the Apoa2 locus is very significant (4.41 × 10−22), which is consistent with the fact that the largest individual study only has 749 animals compared to 4,965 in our combined study.
We visualize the results through a PM-plot, in which P-values are simultaneously visualized with the m-values, which estimates the posterior probability of an effect being present in a study given the observations from all other studies, at each tested locus. These plots allow us to identify in which studies a given variant has an effect and in which it does not. M-values for a given variant have the following interpretation: a study with a small m-value(≤ 0.1) is predicted not to be affected by the variant, while a study with a large m-value(≥ 0.9) is predicted to be affected by the variant.
The PM-plot for the Apoa2 locus is shown in Figure 1 (b). If we only look at the separate study P-values (y-axis), we can conclude that this locus only has an effect in HMDP-chow(M). However, if we look at m-value (x-axis), then we find 8 studies (HMDPxB-ath(M), HMDPxB- ath(F), HMDP-chow(M), HMDP-fat(M), HMDP-fat(F), BxD-db-5(M), BxH-apoe(M), BxH- apoe(F)), where we predict that the variation has an effect, while in 3 studies (BxD-db-12(F), BxD-db-5(F), BxH-wt(M)) we predict there is no effect. The predictions for the remaining 6 studies are ambiguous.
From Figure 1, we observe that differences in effect sizes among the studies are remarkably consistent when considering the environmental factors of each study as described in Table 1. For example, when comparing study 1 – 4, the effect size of the locus decreases in both the male and female HMDPxB studies in the chow diet (chow study) relative to the fat diet (ath study). Thus we can see that when the mice have Leiden/CETP transgene, which cause high total cholesterol level and high LDL cholesterol level, effect size of this locus on HDL cholesterol level in blood is affected by the fat level of diet. Similarly, when comparing study 12 – 15, the knockout of the Apoe gene affects the effect sizes for both male and female BxH crosses. However, in the BxD cross (study 8 – 11), where each animal is homozygous for a mutation causing a deficiency of the leptin receptor, the effect of the locus is very strong in the young male animals, while as animals get older and become fatter, the effect becomes weaker. However in the case of female mice, the effect of the locus is nearly absent at both 5 and 12 weeks of age. Thus we can see that sex plays an important role in affecting HDL when the leptin receptor activity is deficient .

The full citation of our paper is:

Kang, Eun Yong; Han, Buhm; Furlotte, Nicholas; Joo, Jong Wha; Shih, Diana; Davis, Richard; Lusis, Aldons; Eskin, Eleazar (2014): Meta-Analysis Identifies Gene-by-Environment Interactions as Demonstrated in a Study of 4,965 Mice. In: PLoS Genet, 10 (1), pp. e1004022, 2014, ISSN: 1553-7404. (Type: Article | Abstract | Links | BibTeX)

Bibliography

Tags: , , , ,

ibd-figure

An example of IBD graph. IBD detection method provides IBD information (Table). Then we build a graph where vertices are individuals and edges are IBD relationships.

The standard approach for detecting genetic variants involved in disease is the association study where genetic information is collected from a set of individuals who have the disease and a set of healthy individuals. Any genetic variants which are more common in the set of individuals who have the disease, referred to as “associated variants”, may be involved in the disease.

Our group has just published a paper on a alternative and complementary approach for identifying regions involved in disease from the same genetic data. The basic idea is that we consider the patterns of how the individuals are related in different parts of their genomes and how this relates to their disease status. The idea is that if a region is involved in disease, individuals who have the disease will likely have more similar DNA sequences than individuals who do not have the disease. Identifying pairs of individuals with similar DNA sequences is called Identity By Descent (IBD) mapping and there are several methods which can identify IBD relations efficiently(18971310),(21310274),(24207118).

The way our approach works is that in each region of the genome, we build an IBD graph based on which pairs of individual are related where a vertex in the graph is an individual and an edge is a IBD relation which implies that the two individuals have similar DNA sequences at that point.  In our graph, individuals who have the disease are red squares (cases) and individuals who are healthy are green circles (controls).  Following our intuition, if the region is involved in the disease, we expect more edges between pairs of case individuals than between pairs of control individuals.  Our approach simply considers this difference and then apples permutation where the assignment of case and control status to the individuals are randomized in order to obtain a significance level.  Our approach was not the first method to apply this idea and follows the paper by Thompson and Browning(23733848).  The advantage of our paper is that we use a technique called importance sampling to speed up the computation of the significance levels by orders of magnitude. The hope is that this type of approach maybe more effective to identify regions of the genome that are involved in disease through rare variants which are difficult to detect in association studies.

The full citation for the paper is:

Han, Buhm; Kang, Eun Yong; Raychaudhuri, Soumya; de Bakker, Paul; Eskin, Eleazar (2013): Fast Pairwise IBD Association Testing in Genome-wide Association Studies.. In: Bioinformatics, 2013, ISSN: 1367-4811. (Type: Article | Abstract | Links | BibTeX)

Bibliography

Tags: , , ,

A relatively recent excellent documentary developed by NOVA gives a really nice summary of the research area that we work in and the transformation of medicine due to the development of genome sequencing. It is a great place to start learning about our field.

Cracking Your Genetic Code
We are on the brink of a new era of personalized, gene-based medicine. Are we ready for it? Aired March 28, 2012 on PBS
cracking-your-genetic-code-vi

Tags: ,

Emrah Kostem, who graduated this year and is now at Illumina, gave a talk about the research he completed in the lab this summer at our retreat.  It is available here and gives a good overview of what the goals of our group are and some details of the projects that Emrah completed in the lab.

One of the topics he discusses is his recently published work on estimating heritability, which is quantifying the amount that genetics accounts for the variance of a trait.  He discusses his work on how to partition heritability into the contributions of genomic regions(10.1016/j.ajhg.2013.03.010).

He also talks about his work which takes advantage of the insight that association statistics follow the multivariate normal distribution and applies this to two problems.  The first is the problem of selecting follow up SNPs using the results of an association study(10.1534/genetics.111.128595).  The second problem is the problem of speeding up eQTL studies using a two stage approach where only a fraction of the association tests are performed but virtually all of the significant associations are still discovered(10.1089/cmb.2013.0087).

Details of what he talked about are in his papers:

Kostem, Emrah; Eskin, Eleazar (2013): Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions.. In: Am J Hum Genet, 92 (4), pp. 558-64, 2013, ISSN: 1537-6605. (Type: Article | Abstract | Links | BibTeX)
Kostem, Emrah; Eskin, Eleazar (2013): Efficiently Identifying Significant Associations in Genome-wide Association Studies.. In: J Comput Biol, 20 (10), pp. 817-30, 2013, ISSN: 1557-8666. (Type: Article | Abstract | Links | BibTeX)
Kostem, Emrah; Lozano, Jose; Eskin, Eleazar (2011): Increasing Power of Genome-wide Association Studies by Collecting Additional SNPs.. In: Genetics, 2011, ISSN: 1943-2631. (Type: Article | Abstract | Links | BibTeX)

Bibliography

Tags: , , , , ,

Dr. Jae Hoon Sul with his committee.

Dr. Jae Hoon Sul with his committee.

Jae Hoon Sul successfully defended his thesis on Wednesday September 19th.  His talk is posted on our YouTube Channel ZarlabUCLA.  Jae Hoon’s talk discusses several projects including using mixed model to correct for population structure, rare variant association studies and a meta-analysis approach for detecting multi-tissue eQTLs.  Fortunately for the lab, Jae Hoon is staying at UCLA for another year as a post-doc.

More details about what he talks about in his talk are available in the papers he discusses:

Sul, Jae Hoon; Han, Buhm; Ye, Chun; Choi, Ted; Eskin, Eleazar (2013): Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Meta-analytic Approaches. In: PLoS Genet, 9 (6), pp. e1003491, 2013, ISSN: 1553-7404. (Type: Article | Abstract | Links | BibTeX)
Sul, Jae Hoon; Han, Buhm; He, Dan; Eskin, Eleazar (2011): An Optimal Weighted Aggregated Association Test for Identification of Rare Variants Involved in Common Diseases.. In: Genetics, 188 (1), pp. 181-188, 2011, ISSN: 1943-2631. (Type: Article | Abstract | Links | BibTeX)
Kang, Hyun Min; Sul, Jae Hoon; Service, Susan; Zaitlen, Noah; Kong, Sit-Yee; Freimer, Nelson; Sabatti, Chiara; Eskin, Eleazar (2010): Variance component model to account for sample structure in genome-wide association studies.. In: Nat Genet, 42 (4), pp. 348-54, 2010, ISSN: 1546-1718. (Type: Article | Abstract | Links | BibTeX)

Tags: , , ,

Over the past several years, Genome Wide Association Studies (GWAS) have discovered hundreds of genetic variants involved in complex diseases(10.1056/NEJMra0905980).  The vast majority of these variants do not lie in the protein coding regions of genes and thus do not affect what the gene produces, but instead likely affect how the genes are regulated.  For this reason, the study of how genetic variation affect gene activity levels (referred to as expression levels) has been a major focus of research for many years.  Genetic variation that affects gene expression are referred to as expression quantitative trait loci (eQTL)(10.1038/nrg2969).

Several studies collect expression from multiple tissues which leads to the question of whether or not the same genetic variants affect expression in multiple tissues(10.1038/ng.2653).  Another way to ask this question is: Are eQTLs tissue specific or not tissue specific?

A challenge in this type of analysis is that an eQTL may affect expression in multiple tissues, but because of small sample sizes, the eQTL will only be detected in one of the tissues.  Thus, traditional techniques for eQTLs will systematically be biased against detecting eQTLs in multiple tissues.

Jae-Hoon Sul and Buhm Han in our group developed a method to address this issue which builds upon recent methods in random effects meta-analysis(10.1016/j.ajhg.2011.04.014),(10.1371/journal.pgen.1002555).  To apply these methods we first analyze each tissue separately and then use the meta-analysis method to combine the results of each tissue.  Since our methods are specifically designed to handle “heterogeneity” which is that the effect size can be different in each study, our method is able to perform well when the effect is present in all of the tissues or just some of the tissues.  More information about our meta-analysis research is here.

The full citation of our paper is here:

Sul, Jae Hoon; Han, Buhm; Ye, Chun; Choi, Ted; Eskin, Eleazar (2013): Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Meta-analytic Approaches. In: PLoS Genet, 9 (6), pp. e1003491, 2013, ISSN: 1553-7404. (Type: Article | Abstract | Links | BibTeX)

Over the past few years, our group has published several papers on methods for eQTL analysis.  Our other paper on eQTL analysis include:

2014

Joo, Jong Wha; Sul, Jae Hoon; Han, Buhm; Ye, Chun; Eskin, Eleazar (2014): Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies.. In: Genome Biol, 15 (4), pp. R61, 2014, ISSN: 1465-6914. (Type: Article | Abstract | Links | BibTeX)

2013

Kostem, Emrah; Eskin, Eleazar (2013): Efficiently Identifying Significant Associations in Genome-Wide Association Studies. Research in Computational Molecular Biology, University of California Springer Berlin Heidelberg, 2013. (Type: Conference | Abstract | Links | BibTeX)

2010

Kang, Eun Yong; Ye, Chun; Shpitser, Ilya; Eskin, Eleazar (2010): Detecting the presence and absence of causal relationships between expression of yeast genes with very few samples.. In: J Comput Biol, 17 (3), pp. 533-46, 2010, ISSN: 1557-8666. (Type: Article | Abstract | Links | BibTeX)

2009

Ye, Chun; Galbraith, Simon; Liao, James; Eskin, Eleazar (2009): Using network component analysis to dissect regulatory networks mediated by transcription factors in yeast.. In: PLoS Comput Biol, 5 (3), pp. e1000311, 2009, ISSN: 1553-7358. (Type: Article | Abstract | Links | BibTeX)

2008

Kang, Hyun Min; Ye, Chun; Eskin, Eleazar (2008): Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots.. In: Genetics, 180 (4), pp. 1909-25, 2008, ISSN: 0016-6731. (Type: Article | Abstract | Links | BibTeX)

Bibliography

Tags: , , , , ,

The haplotype phasing problem.

The classical haplotype phasing problem.

Over the past few years, our group has written several papers on inferring haplotypes from sequence data.

The problem of Haplotype Inference referred to as Haplotype Phasing has had a long history in computational genetics and the problem itself has had several incarnations.  Genotyping technologies obtain “genotype” information on SNPs which mixes the genetic information from both chromosomes.  However, many genetic analyses require “haplotype” information which is the genetic information on each chromosome (see Figure).

In the early days before reference datasets were available, methods would be applied to large numbers of genotyped individuals which would attempt to identify a small number of haplotypes which explained the majority of the individual genotypes.  Methods from this period include PHASE (11254454) and HAP (14988101) (from our group with Eran Halperin).  The figure is actually one of Eran’s slides from around 2002.

Once reference datasets such as the HapMap became available, imputation based methods such as IMPUTE(10.1038/ng2088) and BEAGLE(10.1016/j.ajhg.2009.01.005) dominated previous phasing approaches because they leveraged information from the carefully curated reference datasets.

In principal, haplotype phasing or imputation methods can be applied directly to sequencing data by first calling genotypes in the sequencing data and then applying a phasing or imputation approach.  However, since each read originates from only one chromosome, if a read spans two genotypes it provides some information on haplotype phase.  Combining these reads to construct haplotypes is referred to as the “haplotypes assembly” problem which was pioneered by   Vikas Bansal and Vineet Bafna(10.1093/bioinformatics/btn298),(10.1101/gr.077065.108).  Dan He in our group developed an optimal method for haplotype assembly which guarantees finding the optimal solution for short reads and reduces the problem of haplotype assembly for longer reads to MaxSAT which finds the optimal solution for the vast majority of problem instances(10.1093/bioinformatics/btq215). More recently, others have developed methods that can discover optimal solutions for all problem instances(10.1093/bioinformatics/btt349). In his paper, Dan also showed that haplotype assembly will always underperform traditional phasing methods for short read sequencing data because too few of the reads span multiple genotypes.

To overcome this issue, Dan extended his methods to jointly perform imputation and haplotype assembly(10.1089/cmb.2012.0091),(10.1016/j.gene.2012.11.093).  These methods outperformed both imputation methods and haplotype assembly methods but unfortunately are too slow and memory intensive to apply in practice.  More recently, in our group, Wen-Yun Yang, Zhanyong Wang, Farhad Hormozdiari with Bogdan Pasaniuc developed a sampling method which is both fast and accurate for combining haplotype assembly and imputation(10.1093/bioinformatics/btt386).

Full citations of our papers are here:

1. He, Dan; Han, Buhm; Eskin, Eleazar (2013): Hap-seq: An Optimal Algorithm for Haplotype Phasing with Imputation Using Sequencing Data.. In: J Comput Biol, 20 (2), pp. 80-92, 2013, ISSN: 1557-8666. (Type: Article | Abstract | Links | BibTeX)
2. Yang, Wen-Yun; Hormozdiari, Farhad; Wang, Zhanyong; He, Dan; Pasaniuc, Bogdan; Eskin, Eleazar (2013): Leveraging Multi-SNP Reads from Sequencing Data for Haplotype Inference.. In: Bioinformatics, 2013, ISSN: 1367-4811. (Type: Article | Abstract | Links | BibTeX)
3. He, Dan; Eskin, Eleazar (2012): Hap-seqX: Expedite Algorithm for Haplotype Phasing with Imputation using Sequence Data.. In: Gene, 2012, ISSN: 1879-0038. (Type: Article | Abstract | Links | BibTeX)
4. He, Dan; Choi, Arthur; Pipatsrisawat, Knot; Darwiche, Adnan; Eskin, Eleazar (2010): Optimal algorithms for haplotype assembly from whole-genome sequence data.. In: Bioinformatics, 26 (12), pp. i183-90, 2010, ISSN: 1367-4811. (Type: Article | Abstract | Links | BibTeX)

Bibliography

Tags: , , , , ,

« Older entries

%d bloggers like this: