Farhad Hormozdiari successfully defended his thesis,”Statistical Methods to Understand the Genetic Architecture of Complex Traits,” on Tuesday, May 17, 2016 in Boelter 4760. His talk, which is posted on our YouTube channel ZarlabUCLA, discusses methods for applying CAVIAR to understand the underlying mechanism of GWAS risk loci, introduces eCAVIAR, a statistical method capable of computing the probability that the same variant is responsible for both the GWAS and eQTL signal, while accounting for complex LD structure, and proposes an approach called phenotype imputation that allows GWAS computation on a phenotype that is difficult to collect.
More details about Farhad’s research are available in the following papers:
Hormozdiari, Farhad; Kichaev, Gleb; Yang, Wen-Yun Y; Pasaniuc, Bogdan; Eskin, Eleazar Identification of causal genes for complex traits. Journal Article In: Bioinformatics, 31 (12), pp. i206-i213, 2015, ISSN: 1367-4811. @article{Hormozdiari:Bioinformatics:2015b, title = {Identification of causal genes for complex traits.}, author = { Farhad Hormozdiari and Gleb Kichaev and Wen-Yun Y. Yang and Bogdan Pasaniuc and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/btv240}, issn = {1367-4811}, year = {2015}, date = {2015-01-01}, journal = {Bioinformatics}, volume = {31}, number = {12}, pages = {i206-i213}, address = {England}, abstract = {MOTIVATION: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations. RESULTS: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability $rho$. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2. AVAILABILITY AND IMPLEMENTATION: Software is freely available for download at genetics.cs.ucla.edu/caviar. CONTACT: eeskin@cs.ucla.edu}, keywords = {}, pubstate = {published}, tppubtype = {article} } MOTIVATION: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations. RESULTS: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability $rho$. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2. AVAILABILITY AND IMPLEMENTATION: Software is freely available for download at genetics.cs.ucla.edu/caviar. CONTACT: eeskin@cs.ucla.edu |
Hormozdiari, Farhad; Joo, Jong Wha J; Wadia, Akshay ; Guan, Feng ; Ostrosky, Rafail ; Sahai, Amit ; Eskin, Eleazar Privacy preserving protocol for detecting genetic relatives using rare variants. Journal Article In: Bioinformatics, 30 (12), pp. i204-i211, 2014, ISSN: 1367-4811. @article{Hormozdiari:Bioinformatics:2014, title = {Privacy preserving protocol for detecting genetic relatives using rare variants.}, author = { Farhad Hormozdiari and Jong Wha J. Joo and Akshay Wadia and Feng Guan and Rafail Ostrosky and Amit Sahai and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/btu294}, issn = {1367-4811}, year = {2014}, date = {2014-01-01}, journal = {Bioinformatics}, volume = {30}, number = {12}, pages = {i204-i211}, address = {England}, abstract = {MOTIVATION: High-throughput sequencing technologies have impacted many areas of genetic research. One such area is the identification of relatives from genetic data. The standard approach for the identification of genetic relatives collects the genomic data of all individuals and stores it in a database. Then, each pair of individuals is compared to detect the set of genetic relatives, and the matched individuals are informed. The main drawback of this approach is the requirement of sharing your genetic data with a trusted third party to perform the relatedness test. RESULTS: In this work, we propose a secure protocol to detect the genetic relatives from sequencing data while not exposing any information about their genomes. We assume that individuals have access to their genome sequences but do not want to share their genomes with anyone else. Unlike previous approaches, our approach uses both common and rare variants which provide the ability to detect much more distant relationships securely. We use a simulated data generated from the 1000 genomes data and illustrate that we can easily detect up to fifth degree cousins which was not possible using the existing methods. We also show in the 1000 genomes data with cryptic relationships that our method can detect these individuals. Availability: The software is freely available for download at http://genetics.cs.ucla.edu/crypto/. CONTACT: fhormoz@cs.ucla.edu or eeskin@cs.ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online}, keywords = {}, pubstate = {published}, tppubtype = {article} } MOTIVATION: High-throughput sequencing technologies have impacted many areas of genetic research. One such area is the identification of relatives from genetic data. The standard approach for the identification of genetic relatives collects the genomic data of all individuals and stores it in a database. Then, each pair of individuals is compared to detect the set of genetic relatives, and the matched individuals are informed. The main drawback of this approach is the requirement of sharing your genetic data with a trusted third party to perform the relatedness test. RESULTS: In this work, we propose a secure protocol to detect the genetic relatives from sequencing data while not exposing any information about their genomes. We assume that individuals have access to their genome sequences but do not want to share their genomes with anyone else. Unlike previous approaches, our approach uses both common and rare variants which provide the ability to detect much more distant relationships securely. We use a simulated data generated from the 1000 genomes data and illustrate that we can easily detect up to fifth degree cousins which was not possible using the existing methods. We also show in the 1000 genomes data with cryptic relationships that our method can detect these individuals. Availability: The software is freely available for download at http://genetics.cs.ucla.edu/crypto/. CONTACT: fhormoz@cs.ucla.edu or eeskin@cs.ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online |
Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar Identifying causal variants at Loci with multiple signals of association. Journal Article In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631. @article{Hormozdiari:Genetics:2014, title = {Identifying causal variants at Loci with multiple signals of association.}, author = { Farhad Hormozdiari and Emrah Kostem and Eun Yong Kang and Bogdan Pasaniuc and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.114.167908}, issn = {1943-2631}, year = {2014}, date = {2014-01-01}, journal = {Genetics}, volume = {198}, number = {2}, pages = {497-508}, address = {United States}, abstract = {Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/}, keywords = {}, pubstate = {published}, tppubtype = {article} } Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/ |
Eskin, Itamar; Hormozdiari, Farhad; Conde, Lucia; Riby, Jacques; Skibola, Chris; Eskin, Eleazar; Halperin, Eran eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data. Journal Article In: J Comput Biol, 2013, ISSN: 1557-8666. @article{Eskin:JComputBiol:2013, title = {eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data.}, author = {Itamar Eskin and Farhad Hormozdiari and Lucia Conde and Jacques Riby and Chris Skibola and Eleazar Eskin and Eran Halperin}, url = {http://dx.doi.org/10.1089/cmb.2013.0105}, issn = {1557-8666}, year = {2013}, date = {2013-01-01}, journal = {J Comput Biol}, organization = {1 The Blavatnik School of Computer Science, Tel-Aviv University , Tel Aviv, Israel .}, abstract = {Abstract The recent advances in high-throughput sequencing technologies bring the potential of a better characterization of the genetic variation in humans and other organisms. In many occasions, either by design or by necessity, the sequencing procedure is performed on a pool of DNA samples with different abundances, where the abundance of each sample is unknown. Such a scenario is naturally occurring in the case of metagenomics analysis where a pool of bacteria is sequenced, or in the case of population studies involving DNA pools by design. Particularly, various pooling designs were recently suggested that can identify carriers of rare alleles in large cohorts, dramatically reducing the cost of such large-scale sequencing projects. A fundamental problem with such approaches for population studies is that the uncertainty of DNA proportions from different individuals in the pools might lead to spurious associations. Fortunately, it is often the case that the genotype data of at least some of the individuals in the pool is known. Here, we propose a method (eALPS) that uses the genotype data in conjunction with the pooled sequence data in order to accurately estimate the proportions of the samples in the pool, even in cases where not all individuals in the pool were genotyped (eALPS-LD). Using real data from a sequencing pooling study of non-Hodgkin's lymphoma, we demonstrate that the estimation of the proportions is crucial, since otherwise there is a risk for false discoveries. Additionally, we demonstrate that our approach is also applicable to the problem of quantification of species in metagenomics samples (eALPS-BCR) and is particularly suitable for metagenomic quantification of closely related species}, keywords = {}, pubstate = {published}, tppubtype = {article} } Abstract The recent advances in high-throughput sequencing technologies bring the potential of a better characterization of the genetic variation in humans and other organisms. In many occasions, either by design or by necessity, the sequencing procedure is performed on a pool of DNA samples with different abundances, where the abundance of each sample is unknown. Such a scenario is naturally occurring in the case of metagenomics analysis where a pool of bacteria is sequenced, or in the case of population studies involving DNA pools by design. Particularly, various pooling designs were recently suggested that can identify carriers of rare alleles in large cohorts, dramatically reducing the cost of such large-scale sequencing projects. A fundamental problem with such approaches for population studies is that the uncertainty of DNA proportions from different individuals in the pools might lead to spurious associations. Fortunately, it is often the case that the genotype data of at least some of the individuals in the pool is known. Here, we propose a method (eALPS) that uses the genotype data in conjunction with the pooled sequence data in order to accurately estimate the proportions of the samples in the pool, even in cases where not all individuals in the pool were genotyped (eALPS-LD). Using real data from a sequencing pooling study of non-Hodgkin's lymphoma, we demonstrate that the estimation of the proportions is crucial, since otherwise there is a risk for false discoveries. Additionally, we demonstrate that our approach is also applicable to the problem of quantification of species in metagenomics samples (eALPS-BCR) and is particularly suitable for metagenomic quantification of closely related species |