Our group publishes papers presenting new methodologies, describing the results of studies that use our software, and reviewing current topics in the field of Bioinformatics. Scroll down or click here for a complete list of papers produced by our lab. Since 2013, we write blog posts summarizing new research papers and review articles:
GWAS
- Fine Mapping Causal Variants and Allelic Heterogeneity
- Widespread Allelic Heterogeneity in Complex Traits
- Selection in Europeans on Fatty Acid Desaturases Associated with Dietary Changes
- Incorporating prior information into association studies
- Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder
- Simultaneous modeling of disease status and clinical phenotypes to increase power in GWAS
- Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Colocalization of GWAS and eQTL Signals Detects Target Genes
- Chromosome conformation elucidates regulatory relationships in developing human brain
Mouse Genetics
- Review Article: The Hybrid Mouse Diversity Panel
- Genes, Environments and Meta-Analysis
- Review Article: Mixed Models and Population Structure
- Identifying Genes Involved in Blood Cell Traits
- Genes, Diet, and Body Weight (in Mice)
- Review Article: Mouse Genetics
Population Structure
- Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models
- Multiple testing correction in linear mixed models
- Identification of causal genes for complex traits (CAVIAR-gene)
- Accurate viral population assembly from ultra-deep sequencing data
- GRAT: Speeding up Expression Quantitative Trail Loci (eQTL) Studies
- Correcting Population Structure using Mixed Models Webcast
- Mixed models can correct for population structure for genomic regions under selection
Review Articles
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Review Article: The Hybrid Mouse Diversity Panel
- Review Article: GWAS and Missing Heritability
- Review Article: Mixed Models and Population Structure
- Review Article: Mouse Genetics
Publications
2018 |
Hormozdiari, Farhad I; Jung, Junghyun; Eskin, Eleazar; Joo, Jong Wha J Leveraging allelic heterogeneity to increase power of association testing Journal Article bioRxiv, pp. 498360, 2018. Abstract | Links | BibTeX | Tags: Alleliec Heterogeneity, Association Study Methods, Multi-SNP Association @article{Hormozdiari:Biorxiv:2018, title = {Leveraging allelic heterogeneity to increase power of association testing}, author = { Farhad I. Hormozdiari and Junghyun Jung and Eleazar Eskin and Jong Wha J. Joo}, url = {http://dx.doi.org/10.1101/498360}, year = {2018}, date = {2018-01-01}, journal = {bioRxiv}, pages = {498360}, publisher = {Cold Spring Harbor Laboratory}, organization = {Department of Computer Science and Engineering, Dongguk University-Seoul}, abstract = {The standard genome-wide association studies (GWAS) detects an association between a single variant and a phenotype of interest. Recently, several studies reported that at many risk loci, there may exist multiple causal variants. For a locus with multiple causal variants with small effect sizes, the standard association test is underpowered to detect the associations. Alternatively, an approach considering effects of multiple variants simultaneously may increase statistical power by leveraging effects of multiple causal variants. In this paper, we propose a new statistical method, Model-based Association test Reflecting causal Status (MARS), that tries to find an association between variants in risk loci and a phenotype, considering the causal status of the variants. One of the main advantages of MARS is that it only requires the existing summary statistics to detect associated risk loci. Thus, MARS is applicable to any association study with summary statistics, even though individual level data is not available for the study. Utilizing extensive simulated data sets, we show that MARS increases the power of detecting true associated risk loci compared to previous approaches that consider multiple variants, while robustly controls the type I error. Applied to data of 44 tissues provided by the Genotype-Tissue Expression (GTEx) consortium, we show that MARS identifies more eGenes compared to previous approaches in most of the tissues; e.g. MARS identified 16% more eGenes than the ones reported by the GTEx consortium. Moreover, applied to Northern Finland Birth Cohort (NFBC) data, we demonstrate that MARS effectively identifies association loci with improved power (56% of more loci found by MARS) inGWAS studies compared to the standard association test.}, keywords = {Alleliec Heterogeneity, Association Study Methods, Multi-SNP Association}, pubstate = {published}, tppubtype = {article} } The standard genome-wide association studies (GWAS) detects an association between a single variant and a phenotype of interest. Recently, several studies reported that at many risk loci, there may exist multiple causal variants. For a locus with multiple causal variants with small effect sizes, the standard association test is underpowered to detect the associations. Alternatively, an approach considering effects of multiple variants simultaneously may increase statistical power by leveraging effects of multiple causal variants. In this paper, we propose a new statistical method, Model-based Association test Reflecting causal Status (MARS), that tries to find an association between variants in risk loci and a phenotype, considering the causal status of the variants. One of the main advantages of MARS is that it only requires the existing summary statistics to detect associated risk loci. Thus, MARS is applicable to any association study with summary statistics, even though individual level data is not available for the study. Utilizing extensive simulated data sets, we show that MARS increases the power of detecting true associated risk loci compared to previous approaches that consider multiple variants, while robustly controls the type I error. Applied to data of 44 tissues provided by the Genotype-Tissue Expression (GTEx) consortium, we show that MARS identifies more eGenes compared to previous approaches in most of the tissues; e.g. MARS identified 16% more eGenes than the ones reported by the GTEx consortium. Moreover, applied to Northern Finland Birth Cohort (NFBC) data, we demonstrate that MARS effectively identifies association loci with improved power (56% of more loci found by MARS) inGWAS studies compared to the standard association test. |
Kang, Eun Yong; Lee, Cue Hyunkyu; Furlotte, Nicholas A; Joo, Jong Wha J; Kostem, Emrah; Zaitlen, Noah; Eskin, Eleazar; Han, Buhm An Association Mapping Framework To Account for Potential Sex Difference in Genetic Architectures. Journal Article Genetics, 2018, ISSN: 1943-2631. Abstract | Links | BibTeX | Tags: Association Study Methods, Meta-Analysis @article{Kang:Genetics:2018, title = {An Association Mapping Framework To Account for Potential Sex Difference in Genetic Architectures.}, author = { Eun Yong Kang and Cue Hyunkyu Lee and Nicholas A. Furlotte and Jong Wha J. Joo and Emrah Kostem and Noah Zaitlen and Eleazar Eskin and Buhm Han}, url = {http://dx.doi.org/10.1534/genetics.117.300501}, issn = {1943-2631}, year = {2018}, date = {2018-01-01}, journal = {Genetics}, address = {United States}, organization = {University of California, Los Angeles.}, abstract = {Over the past few years, genome-wide association studies have identified many trait-associated loci that have different effects on females and males, which increased attention to the genetic architecture differences between the sexes. The between-sex differences in genetic architectures can cause a variety of phenomena such as differences in the effect sizes at trait-associated loci, differences in the magnitudes of polygenic background effects, and differences in the phenotypic variances. However, current association testing approaches for dealing with sex, such as including sex as a covariate, cannot fully account for these phenomena and can be suboptimal in statistical power. We present a novel association mapping framework, MetaSex, that can comprehensively account for the genetic architecture differences between the sexes. Through simulations and applications to real data, we show that our framework has superior performance than previous approaches in association mapping}, keywords = {Association Study Methods, Meta-Analysis}, pubstate = {published}, tppubtype = {article} } Over the past few years, genome-wide association studies have identified many trait-associated loci that have different effects on females and males, which increased attention to the genetic architecture differences between the sexes. The between-sex differences in genetic architectures can cause a variety of phenomena such as differences in the effect sizes at trait-associated loci, differences in the magnitudes of polygenic background effects, and differences in the phenotypic variances. However, current association testing approaches for dealing with sex, such as including sex as a covariate, cannot fully account for these phenomena and can be suboptimal in statistical power. We present a novel association mapping framework, MetaSex, that can comprehensively account for the genetic architecture differences between the sexes. Through simulations and applications to real data, we show that our framework has superior performance than previous approaches in association mapping |
2015 |
Eskin, Eleazar Discovering Genes Involved in Disease and the Mystery of Missing Heritability Journal Article Commun. ACM, 58 (10), pp. 80-87, 2015, ISSN: 0001-0782. Abstract | Links | BibTeX | Tags: Association Study Methods, Heritability, Review @article{Eskin:2015:DGI:2830674.2817827, title = {Discovering Genes Involved in Disease and the Mystery of Missing Heritability}, author = { Eleazar Eskin}, url = {http://doi.acm.org/10.1145/2817827}, doi = {10.1145/2817827}, issn = {0001-0782}, year = {2015}, date = {2015-01-01}, journal = {Commun. ACM}, volume = {58}, number = {10}, pages = {80-87}, publisher = {ACM}, address = {New York, NY, USA}, abstract = {The challenge of missing heritability offers great contribution options for computer scientists. Key Insights: 1. Over the past several years, thousands of genetic variants that have been implicated in dozens of common diseases have been discovered. 2. Despite this progress, only a fraction of the variants involved in disease have been discovered—a phenomenon referred to as “missing heritability.” 3. Many challenges related to understanding the mystery of missing heritability and discovering the variants involved in human disease require analysis of large datasets that present opportunities for computer scientists.}, keywords = {Association Study Methods, Heritability, Review}, pubstate = {published}, tppubtype = {article} } The challenge of missing heritability offers great contribution options for computer scientists. Key Insights: 1. Over the past several years, thousands of genetic variants that have been implicated in dozens of common diseases have been discovered. 2. Despite this progress, only a fraction of the variants involved in disease have been discovered—a phenomenon referred to as “missing heritability.” 3. Many challenges related to understanding the mystery of missing heritability and discovering the variants involved in human disease require analysis of large datasets that present opportunities for computer scientists. |
2013 |
Kostem, Emrah; Eskin, Eleazar Efficiently Identifying Significant Associations in Genome-Wide Association Studies Conference Research in Computational Molecular Biology, University of California Springer Berlin Heidelberg, 2013. Abstract | Links | BibTeX | Tags: Association Study Methods, Expression QTLs @conference{Kostem:ResearchInComputationalMolecularBiology:201, title = {Efficiently Identifying Significant Associations in Genome-Wide Association Studies}, author = { Emrah Kostem and Eleazar Eskin}, url = {http://dx.doi.org/10.1007/978-3-642-37195-0_10}, year = {2013}, date = {2013-01-01}, booktitle = {Research in Computational Molecular Biology}, pages = {118-131}, publisher = {Springer Berlin Heidelberg}, organization = {University of California}, abstract = {Over the past several years, genome wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome which harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits where only a handful of phenotypes are analyzed per study, in (eQTL) studies, tens of thousands of gene expression levels are measured and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed-models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the SNPs. In the first-stage a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions which may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to state of the art testing approaches by a factor of 75.}, keywords = {Association Study Methods, Expression QTLs}, pubstate = {published}, tppubtype = {conference} } Over the past several years, genome wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome which harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits where only a handful of phenotypes are analyzed per study, in (eQTL) studies, tens of thousands of gene expression levels are measured and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed-models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the SNPs. In the first-stage a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions which may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to state of the art testing approaches by a factor of 75. |
Kostem, Emrah; Eskin, Eleazar Efficiently Identifying Significant Associations in Genome-wide Association Studies. Journal Article J Comput Biol, 20 (10), pp. 817-30, 2013, ISSN: 1557-8666. Abstract | Links | BibTeX | Tags: Association Study Methods, Expression QTLs @article{Kostem:JComputBiol:2013, title = {Efficiently Identifying Significant Associations in Genome-wide Association Studies.}, author = {Emrah Kostem and Eleazar Eskin}, url = {http://dx.doi.org/10.1089/cmb.2013.0087}, issn = {1557-8666}, year = {2013}, date = {2013-01-01}, journal = {J Comput Biol}, volume = {20}, number = {10}, pages = {817-30}, address = {United States}, organization = {1 Computer Science Department, University of California , Los Angeles, California.}, abstract = {Abstract Over the past several years, genome-wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome that harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits, where only a handful of phenotypes are analyzed per study, in eQTL studies, tens of thousands of gene expression levels are measured, and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the single nucleotide polymorphisms (SNPs). In the first stage, a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions that may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to the state-of-the-art testing approaches by a factor of 75}, keywords = {Association Study Methods, Expression QTLs}, pubstate = {published}, tppubtype = {article} } Abstract Over the past several years, genome-wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome that harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits, where only a handful of phenotypes are analyzed per study, in eQTL studies, tens of thousands of gene expression levels are measured, and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the single nucleotide polymorphisms (SNPs). In the first stage, a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions that may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to the state-of-the-art testing approaches by a factor of 75 |
2012 |
Darnell, Gregory; Duong, Dat ; Han, Buhm ; Eskin, Eleazar Incorporating prior information into association studies. Journal Article Bioinformatics, 28 (12), pp. i147-i153, 2012, ISSN: 1367-4811. Abstract | Links | BibTeX | Tags: Association Priors, Association Study Methods @article{Darnell:Bioinformatics:2012, title = {Incorporating prior information into association studies.}, author = { Gregory Darnell and Dat Duong and Buhm Han and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/bts235}, issn = {1367-4811}, year = {2012}, date = {2012-01-01}, journal = {Bioinformatics}, volume = {28}, number = {12}, pages = {i147-i153}, address = {England}, organization = {Department of Computer Science, University of California, Los Angeles, CA 90095, Department of Statistics, University of California, Berkeley, CA 94720 and Department of Human Genetics, University of }, abstract = {SUMMARY: Recent technological developments in measuring genetic variation have ushered in an era of genome-wide association studies which have discovered many genes involved in human disease. Current methods to perform association studies collect genetic information and compare the frequency of variants in individuals with and without the disease. Standard approaches do not take into account any information on whether or not a given variant is likely to have an effect on the disease. We propose a novel method for computing an association statistic which takes into account prior information. Our method improves both power and resolution by 8% and 27%, respectively, over traditional methods for performing association studies when applied to simulations using the HapMap data. Advantages of our method are that it is as simple to apply to association studies as standard methods, the results of the method are interpretable as the method reports p-values, and the method is optimal in its use of prior information in regards to statistical power. AVAILABILITY: The method presented herein is available at http://masa.cs.ucla.edu CONTACT: eeskin@cs.ucla.edu.}, keywords = {Association Priors, Association Study Methods}, pubstate = {published}, tppubtype = {article} } SUMMARY: Recent technological developments in measuring genetic variation have ushered in an era of genome-wide association studies which have discovered many genes involved in human disease. Current methods to perform association studies collect genetic information and compare the frequency of variants in individuals with and without the disease. Standard approaches do not take into account any information on whether or not a given variant is likely to have an effect on the disease. We propose a novel method for computing an association statistic which takes into account prior information. Our method improves both power and resolution by 8% and 27%, respectively, over traditional methods for performing association studies when applied to simulations using the HapMap data. Advantages of our method are that it is as simple to apply to association studies as standard methods, the results of the method are interpretable as the method reports p-values, and the method is optimal in its use of prior information in regards to statistical power. AVAILABILITY: The method presented herein is available at http://masa.cs.ucla.edu CONTACT: eeskin@cs.ucla.edu. |
2011 |
Kostem, Emrah; Lozano, Jose A; Eskin, Eleazar Increasing Power of Genome-wide Association Studies by Collecting Additional SNPs. Journal Article Genetics, 2011, ISSN: 1943-2631. Abstract | Links | BibTeX | Tags: Association Study Methods @article{Kostem:Genetics:2011, title = {Increasing Power of Genome-wide Association Studies by Collecting Additional SNPs.}, author = { Emrah Kostem and Jose A. Lozano and Eleazar Eskin}, url = {http://dx.doi.org/10.1534/genetics.111.128595}, issn = {1943-2631}, year = {2011}, date = {2011-01-01}, journal = {Genetics}, organization = {University of California, Los Angeles;}, abstract = {Genome-wide association studies (GWAS) have been effectively identifying the genomic regions associated with a disease trait. In a typical GWAS, an informative subset of the single nucleotide polymorphisms (SNPs), called tag SNPs, are genotyped in case-control individuals. Once the tag SNP statistics are computed, the genomic regions that are in linkage disequilibrium (LD) with the most significantly associated tag SNPs are believed to contain the causal polymorphisms. However, such LD regions are often large and contain many additional polymorphisms. Following up all the SNPs included in these regions is costly and infeasible for biological validation. In this paper we address how to characterize these regions cost-effectively with the goal of providing investigators a clear direction for biological validation. We introduce a follow-up study approach for identifying all untyped associated SNPs by selecting additional SNPs, called follow-up SNPs, from the associated regions and genotyping them in the original case-control individuals. We introduce a novel SNP selection method with the goal of maximizing the number of associated SNPs among the chosen follow-up SNPs. We show how the observed statistics of the original tag SNPs and human genetic variation reference data such as the HapMap Pro ject can be utilized to identify the follow-up SNPs. We use simulated and real association studies based on the HapMap data and the Wellcome Trust Case-Control Consortium to demonstrate that our method shows superior performance than the correlation and distance based traditional follow-up SNP selection approaches. Our method is publicly available at http://genetics.cs.ucla.edu/followupSNPs.}, keywords = {Association Study Methods}, pubstate = {published}, tppubtype = {article} } Genome-wide association studies (GWAS) have been effectively identifying the genomic regions associated with a disease trait. In a typical GWAS, an informative subset of the single nucleotide polymorphisms (SNPs), called tag SNPs, are genotyped in case-control individuals. Once the tag SNP statistics are computed, the genomic regions that are in linkage disequilibrium (LD) with the most significantly associated tag SNPs are believed to contain the causal polymorphisms. However, such LD regions are often large and contain many additional polymorphisms. Following up all the SNPs included in these regions is costly and infeasible for biological validation. In this paper we address how to characterize these regions cost-effectively with the goal of providing investigators a clear direction for biological validation. We introduce a follow-up study approach for identifying all untyped associated SNPs by selecting additional SNPs, called follow-up SNPs, from the associated regions and genotyping them in the original case-control individuals. We introduce a novel SNP selection method with the goal of maximizing the number of associated SNPs among the chosen follow-up SNPs. We show how the observed statistics of the original tag SNPs and human genetic variation reference data such as the HapMap Pro ject can be utilized to identify the follow-up SNPs. We use simulated and real association studies based on the HapMap data and the Wellcome Trust Case-Control Consortium to demonstrate that our method shows superior performance than the correlation and distance based traditional follow-up SNP selection approaches. Our method is publicly available at http://genetics.cs.ucla.edu/followupSNPs. |
Han, Buhm; Hackel, Brian M; Eskin, Eleazar Postassociation cleaning using linkage disequilibrium information. Journal Article Genet Epidemiol, 35 (1), pp. 1-10, 2011, ISSN: 1098-2272. Abstract | Links | BibTeX | Tags: Association Study Methods @article{Han:GenetEpidemiol:2011, title = {Postassociation cleaning using linkage disequilibrium information.}, author = { Buhm Han and Brian M. Hackel and Eleazar Eskin}, url = {http://dx.doi.org/10.1002/gepi.20544}, issn = {1098-2272}, year = {2011}, date = {2011-01-01}, journal = {Genet Epidemiol}, volume = {35}, number = {1}, pages = {1-10}, address = {United States}, organization = {Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California.}, abstract = {In genetic association studies, quality control (QC) filters are applied to remove potentially problematic markers before the markers are tested for statistical associations. However, spurious associations can still occur after QC. We introduce Post-Association Cleaning (PAC) approach that can complement QC by capturing spurious associations using the information in the post-association results. Specifically, we propose a PAC filter based on the linkage disequilibrium (LD) information. The intuition is that if the association is caused by a true genetic effect, neighboring markers in LD should show comparably significant P-values. If not, it may be evidence of spurious association. Previous studies have applied the same idea but only manually without a formal statistical framework. Our proposed method LD-PAC provides a systematic framework to quantitatively measure the evidence of spurious associations based on the likelihood ratio. Simulations show that LD-PAC can detect spurious associations with high detection rate (84%). In addition to detecting spurious associations, our method can also be used to "rescue" candidate associations from the supposedly unclean data such as the markers excluded by QC. Although the additional associations must be treated with care, they can suggest interesting regions. The application of our method to the Wellcome Trust Case Control Consortium (WTCCC) data led to the discovery of an additional candidate association for type 1 diabetes among the QC-excluded markers. This locus turns out to be in a region recently identified as significant by a meta-analysis performed after the WTCCC study was published. Genet. Epidemiol. 35:1-10, 2011. copyright 2010 Wiley-Liss, Inc.}, keywords = {Association Study Methods}, pubstate = {published}, tppubtype = {article} } In genetic association studies, quality control (QC) filters are applied to remove potentially problematic markers before the markers are tested for statistical associations. However, spurious associations can still occur after QC. We introduce Post-Association Cleaning (PAC) approach that can complement QC by capturing spurious associations using the information in the post-association results. Specifically, we propose a PAC filter based on the linkage disequilibrium (LD) information. The intuition is that if the association is caused by a true genetic effect, neighboring markers in LD should show comparably significant P-values. If not, it may be evidence of spurious association. Previous studies have applied the same idea but only manually without a formal statistical framework. Our proposed method LD-PAC provides a systematic framework to quantitatively measure the evidence of spurious associations based on the likelihood ratio. Simulations show that LD-PAC can detect spurious associations with high detection rate (84%). In addition to detecting spurious associations, our method can also be used to "rescue" candidate associations from the supposedly unclean data such as the markers excluded by QC. Although the additional associations must be treated with care, they can suggest interesting regions. The application of our method to the Wellcome Trust Case Control Consortium (WTCCC) data led to the discovery of an additional candidate association for type 1 diabetes among the QC-excluded markers. This locus turns out to be in a region recently identified as significant by a meta-analysis performed after the WTCCC study was published. Genet. Epidemiol. 35:1-10, 2011. copyright 2010 Wiley-Liss, Inc. |
2010 |
Santana, Roberto; Mendiburu, Alexander ; Zaitlen, Noah ; Eskin, Eleazar ; Lozano, Jose A Multi-marker tagging single nucleotide polymorphism selection using estimation of distribution algorithms. Journal Article Artif Intell Med, 2010, ISSN: 1873-2860. Abstract | Links | BibTeX | Tags: Association Study Methods @article{Santana:ArtifIntellMed:2010, title = {Multi-marker tagging single nucleotide polymorphism selection using estimation of distribution algorithms.}, author = { Roberto Santana and Alexander Mendiburu and Noah Zaitlen and Eleazar Eskin and Jose A. Lozano}, url = {http://dx.doi.org/10.1016/j.artmed.2010.05.010}, issn = {1873-2860}, year = {2010}, date = {2010-01-01}, journal = {Artif Intell Med}, organization = {Faculty of Informatics, Universidad Politécnica de Madrid, R. 3306, Campus de Montegancedo, 28660 Boadilla del Monte, Madrid, Spain.}, abstract = {OBJECTIVES: This paper presents an optimization algorithm for the automatic selection of a minimal subset of tagging single nucleotide polymorphisms (SNPs). METHODS AND MATERIALS: The determination of the set of minimal tagging SNPs is approached as an optimization problem in which each tagged SNP can be covered by a single tagging SNP or by a pair of tagging SNPs. The problem is solved using an estimation of distribution algorithm (EDA) which takes advantage of the underlying topological structure defined by the SNP correlations to model the problem interactions. The EDA stochastically searches the constrained space of feasible solutions. It is evaluated across HapMap reference panel data sets. RESULTS: The EDA was compared with a SAT solver, able to find the single-marker minimal tagging sets, and with the Tagger program. The percentage of reduction ranged from 10% to 43% in the number of tagging SNPs of the minimal multi-marker tagging set found by the EDA with respect to the other algorithms. CONCLUSIONS: The introduced algorithm is effective for the identification of minimal multi-marker SNP sets, which considerably reduce the dimension of the tagging SNP set in comparison with single-marker sets. Other variants of the SNP problem can be treated following the same approach.}, keywords = {Association Study Methods}, pubstate = {published}, tppubtype = {article} } OBJECTIVES: This paper presents an optimization algorithm for the automatic selection of a minimal subset of tagging single nucleotide polymorphisms (SNPs). METHODS AND MATERIALS: The determination of the set of minimal tagging SNPs is approached as an optimization problem in which each tagged SNP can be covered by a single tagging SNP or by a pair of tagging SNPs. The problem is solved using an estimation of distribution algorithm (EDA) which takes advantage of the underlying topological structure defined by the SNP correlations to model the problem interactions. The EDA stochastically searches the constrained space of feasible solutions. It is evaluated across HapMap reference panel data sets. RESULTS: The EDA was compared with a SAT solver, able to find the single-marker minimal tagging sets, and with the Tagger program. The percentage of reduction ranged from 10% to 43% in the number of tagging SNPs of the minimal multi-marker tagging set found by the EDA with respect to the other algorithms. CONCLUSIONS: The introduced algorithm is effective for the identification of minimal multi-marker SNP sets, which considerably reduce the dimension of the tagging SNP set in comparison with single-marker sets. Other variants of the SNP problem can be treated following the same approach. |
2009 |
Zaitlen, Noah; Kang, Hyun Min ; Eskin, Eleazar Linkage Effects and Analysis of Finite Sample Errors in the HapMap. Journal Article Hum Hered, 68 (2), pp. 73-86, 2009, ISSN: 1423-0062. Abstract | Links | BibTeX | Tags: Association Study Methods @article{Zaitlen:HumHered:2009, title = {Linkage Effects and Analysis of Finite Sample Errors in the HapMap.}, author = { Noah Zaitlen and Hyun Min Kang and Eleazar Eskin}, url = {http://dx.doi.org/10.1159/000212500}, issn = {1423-0062}, year = {2009}, date = {2009-01-01}, journal = {Hum Hered}, volume = {68}, number = {2}, pages = {73-86}, organization = {Bioinformatics Program, University of California, San Diego, Calif., USA.}, abstract = {The HapMap provides a valuable resource to help uncover genetic variants of important complex phenotypes such as disease risk and outcome. Using the HapMap we can infer the patterns of LD within different human populations. This is a critical step for determining which SNPs to genotype as part of a study, estimating study power, designing a follow-up study to identify the causal variants, 'imputing' untyped SNPs, and estimating recombination rates along the genome. Despite its tremendous importance, the HapMap suffers from the fundamental limitation that at most 60 unrelated individuals are available per population. We present an analytical framework for analyzing the implications of a finite sample HapMap. We present and justify simple approximations for deriving analytical estimates of important statistics such as the square of the correlation coefficient r(2) between two SNPs. Finally, we use this framework to show that current HapMap based estimates of r(2) and power have significant errors, and that tag sets highly overestimate their coverage. We show that a reasonable increase in the number of individuals, such as that proposed by the 1000 genomes project, greatly reduces the errors due to finite sample size for a large proportion of SNPs.}, keywords = {Association Study Methods}, pubstate = {published}, tppubtype = {article} } The HapMap provides a valuable resource to help uncover genetic variants of important complex phenotypes such as disease risk and outcome. Using the HapMap we can infer the patterns of LD within different human populations. This is a critical step for determining which SNPs to genotype as part of a study, estimating study power, designing a follow-up study to identify the causal variants, 'imputing' untyped SNPs, and estimating recombination rates along the genome. Despite its tremendous importance, the HapMap suffers from the fundamental limitation that at most 60 unrelated individuals are available per population. We present an analytical framework for analyzing the implications of a finite sample HapMap. We present and justify simple approximations for deriving analytical estimates of important statistics such as the square of the correlation coefficient r(2) between two SNPs. Finally, we use this framework to show that current HapMap based estimates of r(2) and power have significant errors, and that tag sets highly overestimate their coverage. We show that a reasonable increase in the number of individuals, such as that proposed by the 1000 genomes project, greatly reduces the errors due to finite sample size for a large proportion of SNPs. |
2008 |
Han, B; Kang, H M; Seo, M S; Zaitlen, N; Eskin, E Efficient association study design via power-optimized tag SNP selection. Journal Article Ann Hum Genet, 72 (Pt 6), pp. 834-47, 2008, ISSN: 1469-1809. Abstract | Links | BibTeX | Tags: Association Study Methods @article{Han:AnnHumGenet:2008, title = {Efficient association study design via power-optimized tag SNP selection.}, author = { B. Han and H. M. Kang and M. S. Seo and N. Zaitlen and E. Eskin}, url = {http://dx.doi.org/10.1111/j.1469-1809.2008.00469.x}, issn = {1469-1809}, year = {2008}, date = {2008-01-01}, journal = {Ann Hum Genet}, volume = {72}, number = {Pt 6}, pages = {834-47}, address = {England}, organization = {Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, USA.}, abstract = {Discovering statistical correlation between causal genetic variation and clinical traits through association studies is an important method for identifying the genetic basis of human diseases. Since fully resequencing a cohort is prohibitively costly, genetic association studies take advantage of local correlation structure (or linkage disequilibrium) between single nucleotide polymorphisms (SNPs) by selecting a subset of SNPs to be genotyped (tag SNPs). While many current association studies are performed using commercially available high-throughput genotyping products that define a set of tag SNPs, choosing tag SNPs remains an important problem for both custom follow-up studies as well as designing the high-throughput genotyping products themselves. The most widely used tag SNP selection method optimizes the correlation between SNPs (r(2)). However, tag SNPs chosen based on an r(2) criterion do not necessarily maximize the statistical power of an association study. We propose a study design framework that chooses SNPs to maximize power and efficiently measures the power through empirical simulation. Empirical results based on the HapMap data show that our method gains considerable power over a widely used r(2)-based method, or equivalently reduces the number of tag SNPs required to attain the desired power of a study. Our power-optimized 100k whole genome tag set provides equivalent power to the Affymetrix 500k chip for the CEU population. For the design of custom follow-up studies, our method provides up to twice the power increase using the same number of tag SNPs as r(2)-based methods. Our method is publicly available via web server at http://design.cs.ucla.edu.}, keywords = {Association Study Methods}, pubstate = {published}, tppubtype = {article} } Discovering statistical correlation between causal genetic variation and clinical traits through association studies is an important method for identifying the genetic basis of human diseases. Since fully resequencing a cohort is prohibitively costly, genetic association studies take advantage of local correlation structure (or linkage disequilibrium) between single nucleotide polymorphisms (SNPs) by selecting a subset of SNPs to be genotyped (tag SNPs). While many current association studies are performed using commercially available high-throughput genotyping products that define a set of tag SNPs, choosing tag SNPs remains an important problem for both custom follow-up studies as well as designing the high-throughput genotyping products themselves. The most widely used tag SNP selection method optimizes the correlation between SNPs (r(2)). However, tag SNPs chosen based on an r(2) criterion do not necessarily maximize the statistical power of an association study. We propose a study design framework that chooses SNPs to maximize power and efficiently measures the power through empirical simulation. Empirical results based on the HapMap data show that our method gains considerable power over a widely used r(2)-based method, or equivalently reduces the number of tag SNPs required to attain the desired power of a study. Our power-optimized 100k whole genome tag set provides equivalent power to the Affymetrix 500k chip for the CEU population. For the design of custom follow-up studies, our method provides up to twice the power increase using the same number of tag SNPs as r(2)-based methods. Our method is publicly available via web server at http://design.cs.ucla.edu. |
Choi, Arthur; Zaitlen, Noah ; Han, Buhm ; Pipatsrisawat, Knot ; Darwiche, Adnan ; Eskin, Eleazar Efficient Genome Wide Tagging by Reduction to SAT Conference Lecture Notes in Computer Science, 5251/2008 , Lecture Notes in Computer Science Springer Berlin / Heidelberg, 2008, ISSN: 0302-9743 (Print) 1611-3349 (Online). Abstract | Links | BibTeX | Tags: Association Study Methods @conference{Choi:LectureNotesInComputerScience:2008, title = {Efficient Genome Wide Tagging by Reduction to SAT}, author = { Arthur Choi and Noah Zaitlen and Buhm Han and Knot Pipatsrisawat and Adnan Darwiche and Eleazar Eskin}, url = {http://dx.doi.org/10.1007/978-3-540-87361-7}, issn = {0302-9743 (Print) 1611-3349 (Online)}, year = {2008}, date = {2008-01-01}, booktitle = {Lecture Notes in Computer Science}, volume = {5251/2008}, pages = {135-147}, publisher = {Springer Berlin / Heidelberg}, series = {Lecture Notes in Computer Science}, abstract = {Whole genome association has recently demonstrated some remarkable successes in identifying loci involved in disease. Designing these studies involves selecting a subset of known single nucleotide polymorphisms (SNPs) or tag SNPs to be genotyped. The problem of choosing tag SNPs is an active area of research and is usually formulated such that the goal is to select the fewest number of tag SNPs which cover the remaining SNPs where cover is defined by some statistical criterion. Since the standard formulation of the tag SNP selection problem is NP-hard, most algorithms for selecting tag SNPs are either heuristics which do not guarantee selection of the minimal set of tag SNPs or are exhaustive algorithms which are computationally impractical. In this paper, we present a set of methods which guarantee discovering the minimal set of tag SNPs, yet in practice are much faster than traditional exhaustive algorithms. We demonstrate that our methods can be applied to discover minimal tag sets for the entire human genome. Our method converts the instance of the tag SNP selection problem to an instance of the satisfiability problem, encoding the instance into conjunctive normal form (CNF). We take advantage of the local structure inherent in human variation, as well as progress in knowledge compilation, and convert our CNF encoding into a representation known as DNNF, from which solutions to our original problem can be easily enumerated. We demonstrate our methods by constructing the optimal tag set for the whole genome and show that we significantly outperform previous exhaustive search-based methods. We also present optimal solutions for the problem of selecting multi-marker tags in which some SNPs are covered by a pair of tag SNPs. Multi-marker tags can significantly decrease the number of tags we need to select, however discovering the minimal number of multi-marker tags is much more difficult. We evaluate our methods and perform benchmark comparisons to other methods by choosing tag sets using the HapMap data.}, keywords = {Association Study Methods}, pubstate = {published}, tppubtype = {conference} } Whole genome association has recently demonstrated some remarkable successes in identifying loci involved in disease. Designing these studies involves selecting a subset of known single nucleotide polymorphisms (SNPs) or tag SNPs to be genotyped. The problem of choosing tag SNPs is an active area of research and is usually formulated such that the goal is to select the fewest number of tag SNPs which cover the remaining SNPs where cover is defined by some statistical criterion. Since the standard formulation of the tag SNP selection problem is NP-hard, most algorithms for selecting tag SNPs are either heuristics which do not guarantee selection of the minimal set of tag SNPs or are exhaustive algorithms which are computationally impractical. In this paper, we present a set of methods which guarantee discovering the minimal set of tag SNPs, yet in practice are much faster than traditional exhaustive algorithms. We demonstrate that our methods can be applied to discover minimal tag sets for the entire human genome. Our method converts the instance of the tag SNP selection problem to an instance of the satisfiability problem, encoding the instance into conjunctive normal form (CNF). We take advantage of the local structure inherent in human variation, as well as progress in knowledge compilation, and convert our CNF encoding into a representation known as DNNF, from which solutions to our original problem can be easily enumerated. We demonstrate our methods by constructing the optimal tag set for the whole genome and show that we significantly outperform previous exhaustive search-based methods. We also present optimal solutions for the problem of selecting multi-marker tags in which some SNPs are covered by a pair of tag SNPs. Multi-marker tags can significantly decrease the number of tags we need to select, however discovering the minimal number of multi-marker tags is much more difficult. We evaluate our methods and perform benchmark comparisons to other methods by choosing tag sets using the HapMap data. |
Eskin, Eleazar Lecture Notes in Computer Science, 4955/2008 , Lecture Notes in Computer Science Springer Berlin / Heidelberg, 2008, ISSN: 0302-9743 (Print) 1611-3349 (Online). Abstract | Links | BibTeX | Tags: Association Priors, Association Study Methods @conference{Eskin:LectureNotesInComputerScience:2008, title = {Increasing Power in Association Studies by Using Linkage Disequilibrium Structure and Molecular Function as Prior Information}, author = { Eleazar Eskin}, url = {http://dx.doi.org/10.1007/978-3-540-78839-3}, issn = {0302-9743 (Print) 1611-3349 (Online)}, year = {2008}, date = {2008-01-01}, booktitle = {Lecture Notes in Computer Science}, volume = {4955/2008}, pages = {434}, publisher = {Springer Berlin / Heidelberg}, series = {Lecture Notes in Computer Science}, abstract = {The availability of various types of genomic data provide an opportunity to incorporate this data as prior information in genetic association studies. This information includes knowledge of linkage disequilibrium structure as well as knowledge of which regions are likely to be involved in disease. In this paper, we present an approach for incorporating this information by revisiting how we perform multiple hypothesis correction. In a traditional association study, in order to correct for multiple hypothesis testing, the significance threshold at each marker, t, is set to control the total false positive rate. In our framework, we vary the threshold at each marker ti and use these thresholds to incorporate prior information. We present a novel Multi-threshold Association Study Analysis (MASA) method for setting these threshold to maximize the statistical power of the study in the context of the additional information. Intuitively markers which are correlated with many polymorphisms will have higher thresholds than other markers. The simplest approach for encoding prior information is through assuming a causal probability distribution. In this setting, we assume that the causal polymorphism is chosen from this distribution and only one polymorphism is causal. We refer to the probability that the polymorphism i is causal as its causal probability, ci. Given the causal probabilities, using the approach presented in this paper, we can numerically solve for the marker thresholds which maximize power. By taking advantage of this information, we show how our multi-threshold framework can significantly increase the power of association studies while still controlling the overall false positive rate, $alpha$, of the study as long as ti=$alpha$. We present a numerical procedure for solving for thresholds that maximize association study power using prior information. We present the results of benchmark simulation experiments using the HapMap data which demonstrate a significant increase in association study power under this framework. Our optimization algorithm is very efficient and we can obtain thresholds for whole genome associations in minutes. We also present an efficient permutation procedure for correctly adjusting the false positive rate for correlated markers and show how the this approach increases computational time only slightly relative to performing permutation tests for traditional association studies. We provide a webserver for performing association studies using this method at http://masa.cs.ucla.edu/. On the website, we provide thresholds optimized for the the Affymetrix 500k and Illumina HumanHap 550 chips.}, keywords = {Association Priors, Association Study Methods}, pubstate = {published}, tppubtype = {conference} } The availability of various types of genomic data provide an opportunity to incorporate this data as prior information in genetic association studies. This information includes knowledge of linkage disequilibrium structure as well as knowledge of which regions are likely to be involved in disease. In this paper, we present an approach for incorporating this information by revisiting how we perform multiple hypothesis correction. In a traditional association study, in order to correct for multiple hypothesis testing, the significance threshold at each marker, t, is set to control the total false positive rate. In our framework, we vary the threshold at each marker ti and use these thresholds to incorporate prior information. We present a novel Multi-threshold Association Study Analysis (MASA) method for setting these threshold to maximize the statistical power of the study in the context of the additional information. Intuitively markers which are correlated with many polymorphisms will have higher thresholds than other markers. The simplest approach for encoding prior information is through assuming a causal probability distribution. In this setting, we assume that the causal polymorphism is chosen from this distribution and only one polymorphism is causal. We refer to the probability that the polymorphism i is causal as its causal probability, ci. Given the causal probabilities, using the approach presented in this paper, we can numerically solve for the marker thresholds which maximize power. By taking advantage of this information, we show how our multi-threshold framework can significantly increase the power of association studies while still controlling the overall false positive rate, $alpha$, of the study as long as ti=$alpha$. We present a numerical procedure for solving for thresholds that maximize association study power using prior information. We present the results of benchmark simulation experiments using the HapMap data which demonstrate a significant increase in association study power under this framework. Our optimization algorithm is very efficient and we can obtain thresholds for whole genome associations in minutes. We also present an efficient permutation procedure for correctly adjusting the false positive rate for correlated markers and show how the this approach increases computational time only slightly relative to performing permutation tests for traditional association studies. We provide a webserver for performing association studies using this method at http://masa.cs.ucla.edu/. On the website, we provide thresholds optimized for the the Affymetrix 500k and Illumina HumanHap 550 chips. |
Eskin, Eleazar Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information. Journal Article Genome Res, 18 (4), pp. 653-60, 2008, ISSN: 1088-9051. Abstract | Links | BibTeX | Tags: Association Priors, Association Study Methods @article{Eskin:GenomeRes:2008, title = {Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information.}, author = { Eleazar Eskin}, url = {http://dx.doi.org/10.1101/gr.072785.107}, issn = {1088-9051}, year = {2008}, date = {2008-01-01}, journal = {Genome Res}, volume = {18}, number = {4}, pages = {653-60}, address = {United States}, organization = {Department of Computer Science and Human Genetics, University of California, Los Angeles, Los Angeles, California 90095, USA. eeskin@cs.ucla.edu}, abstract = {The availability of various types of genomic data provides an opportunity to incorporate this data as prior information in genetic association studies. This information includes knowledge of linkage disequilibrium structure as well as which regions are likely to be involved in disease. In this paper, we present an approach for incorporating this information by revisiting how we perform multiple-hypothesis correction. In a traditional association study, in order to correct for multiple-hypothesis testing, the significance threshold at each marker, t, is set to control the total false-positive rate. In our framework, we vary the threshold at each marker t(i) and use these thresholds to incorporate prior information. We present a numerical procedure for solving for thresholds that maximizes association study power using prior information. We also present the results of benchmark simulation experiments using the HapMap data, which demonstrate a significant increase in association study power under this framework. We provide a Web server for performing association studies using our method and provide thresholds optimized for the Affymetrix 500 k and Illumina HumanHap 550 chips and demonstrate the application of our framework to the analysis of the Wellcome Trust Case Control Consortium data.}, keywords = {Association Priors, Association Study Methods}, pubstate = {published}, tppubtype = {article} } The availability of various types of genomic data provides an opportunity to incorporate this data as prior information in genetic association studies. This information includes knowledge of linkage disequilibrium structure as well as which regions are likely to be involved in disease. In this paper, we present an approach for incorporating this information by revisiting how we perform multiple-hypothesis correction. In a traditional association study, in order to correct for multiple-hypothesis testing, the significance threshold at each marker, t, is set to control the total false-positive rate. In our framework, we vary the threshold at each marker t(i) and use these thresholds to incorporate prior information. We present a numerical procedure for solving for thresholds that maximizes association study power using prior information. We also present the results of benchmark simulation experiments using the HapMap data, which demonstrate a significant increase in association study power under this framework. We provide a Web server for performing association studies using our method and provide thresholds optimized for the Affymetrix 500 k and Illumina HumanHap 550 chips and demonstrate the application of our framework to the analysis of the Wellcome Trust Case Control Consortium data. |
2007 |
Zaitlen, Noah; Kang, Hyun Min ; Eskin, Eleazar ; Halperin, Eran Leveraging the HapMap correlation structure in association studies. Journal Article Am J Hum Genet, 80 (4), pp. 683-91, 2007, ISSN: 0002-9297. Abstract | Links | BibTeX | Tags: Association Study Methods @article{Zaitlen:AmJHumGenet:2007, title = {Leveraging the HapMap correlation structure in association studies.}, author = { Noah Zaitlen and Hyun Min Kang and Eleazar Eskin and Eran Halperin}, url = {http://dx.doi.org/10.1086/513109}, issn = {0002-9297}, year = {2007}, date = {2007-01-01}, journal = {Am J Hum Genet}, volume = {80}, number = {4}, pages = {683-91}, address = {United States}, organization = {Bioinformatics Program, University of California-San Diego, La Jolla, CA, USA.}, abstract = {Recent high-throughput genotyping technologies, such as the Affymetrix 500k array and the Illumina HumanHap 550 beadchip, have driven down the costs of association studies and have enabled the measurement of single-nucleotide polymorphism (SNP) allele frequency differences between case and control populations on a genomewide scale. A key aspect in the efficiency of association studies is the notion of "indirect association," where only a subset of SNPs are collected to serve as proxies for the uncollected SNPs, taking advantage of the correlation structure between SNPs. Recently, a new class of methods for indirect association, multimarker methods, has been proposed. Although the multimarker methods are a considerable advancement, current methods do not fully take advantage of the correlation structure between SNPs and their multimarker proxies. In this article, we propose a novel multimarker indirect-association method, WHAP, that is based on a weighted sum of the haplotype frequency differences. In contrast to traditional indirect-association methods, we show analytically that there is a considerable gain in power achieved by our method compared with both single-marker and multimarker tests, as well as traditional haplotype-based tests. Our results are supported by empirical evaluation across the HapMap reference panel data sets, and a software implementation for the Affymetrix 500k and Illumina HumanHap 550 chips is available for download.}, keywords = {Association Study Methods}, pubstate = {published}, tppubtype = {article} } Recent high-throughput genotyping technologies, such as the Affymetrix 500k array and the Illumina HumanHap 550 beadchip, have driven down the costs of association studies and have enabled the measurement of single-nucleotide polymorphism (SNP) allele frequency differences between case and control populations on a genomewide scale. A key aspect in the efficiency of association studies is the notion of "indirect association," where only a subset of SNPs are collected to serve as proxies for the uncollected SNPs, taking advantage of the correlation structure between SNPs. Recently, a new class of methods for indirect association, multimarker methods, has been proposed. Although the multimarker methods are a considerable advancement, current methods do not fully take advantage of the correlation structure between SNPs and their multimarker proxies. In this article, we propose a novel multimarker indirect-association method, WHAP, that is based on a weighted sum of the haplotype frequency differences. In contrast to traditional indirect-association methods, we show analytically that there is a considerable gain in power achieved by our method compared with both single-marker and multimarker tests, as well as traditional haplotype-based tests. Our results are supported by empirical evaluation across the HapMap reference panel data sets, and a software implementation for the Affymetrix 500k and Illumina HumanHap 550 chips is available for download. |
2005 |
Hinds, David A; Stuve, Laura L; Nilsen, Geoffrey B; Halperin, Eran ; Eskin, Eleazar ; Ballinger, Dennis G; Frazer, Kelly A; Cox, David R Whole-genome patterns of common DNA variation in three human populations. Journal Article Science, 307 (5712), pp. 1072-9, 2005, ISSN: 1095-9203. Abstract | Links | BibTeX | Tags: Association Study Methods, Population Genetics @article{Hinds:Science:2005, title = {Whole-genome patterns of common DNA variation in three human populations.}, author = { David A. Hinds and Laura L. Stuve and Geoffrey B. Nilsen and Eran Halperin and Eleazar Eskin and Dennis G. Ballinger and Kelly A. Frazer and David R. Cox}, url = {http://dx.doi.org/10.1126/science.1105436}, issn = {1095-9203}, year = {2005}, date = {2005-01-01}, journal = {Science}, volume = {307}, number = {5712}, pages = {1072-9}, address = {United States}, organization = {Perlegen Sciences Inc., 2021 Stierlin Court, Mountain View, CA 94043, USA.}, abstract = {Individual differences in DNA sequence are the genetic basis of human variability. We have characterized whole-genome patterns of common human DNA variation by genotyping 1,586,383 single-nucleotide polymorphisms (SNPs) in 71 Americans of European, African, and Asian ancestry. Our results indicate that these SNPs capture most common genetic variation as a result of linkage disequilibrium, the correlation among common SNP alleles. We observe a strong correlation between extended regions of linkage disequilibrium and functional genomic elements. Our data provide a tool for exploring many questions that remain regarding the causal role of common human DNA variation in complex human traits and for investigating the nature of genetic variation within and between human populations.}, keywords = {Association Study Methods, Population Genetics}, pubstate = {published}, tppubtype = {article} } Individual differences in DNA sequence are the genetic basis of human variability. We have characterized whole-genome patterns of common human DNA variation by genotyping 1,586,383 single-nucleotide polymorphisms (SNPs) in 71 Americans of European, African, and Asian ancestry. Our results indicate that these SNPs capture most common genetic variation as a result of linkage disequilibrium, the correlation among common SNP alleles. We observe a strong correlation between extended regions of linkage disequilibrium and functional genomic elements. Our data provide a tool for exploring many questions that remain regarding the causal role of common human DNA variation in complex human traits and for investigating the nature of genetic variation within and between human populations. |