Fine Mapping Causal Variants and Allelic Heterogeneity

On Friday, April 28, 2017, in the CNSI Auditorium, Eleazar Eskin presented ZarLab’s research on fine mapping causal variants and allelic heterogeneity at the 2nd Annual Institute for Quantitative and Computational Biosciences (QCBio) Symposium.

Geneticists use a technique called Genome Wide Association Studies (GWAS) to identify genetic variants that cause an individual to exhibit a particular trait or disease. Typically, GWAS identifies an association signal which suggests that genetic variants within a region of the genome — known as a locus —  are associated with the condition. The process of identifying the actual variant in the region which has an affect on the disease is referred to as “fine mapping.”

In addition to finding the actual variants affecting a disease, fine mapping also seeks to address questions that are related to the genetic basis of disease. First, how many causal variants does a locus contain? A disease could be caused by one, single variant or multiple variants that independently affect disease status. We refer to the latter phenomenon as allelic heterogeneity (AH).

Second, when analyzing results from multiple GWASes, can the same causal variant identified in one study be assumed causal in other studies? A GWAS can identify many variants that are associated with two or more traits; however, this correlation can be induced by a confounding factor known as linkage disequilibrium. Colocalization methods seek to identify shared and distinct causal variants.

Farhad Hormozdiari, a recent alumnus of our group and a post-doc at Harvard University, developed several novel approaches for improving the accuracy and efficiency of fine mapping despite presence of AH in the study population. Hormozdiari’s software, CAVIAR, CAVIAR-Genes, and eCAVIAR, are capable of quantifying the probability of a variant to be causal in GWAS and eQTL studies, while allowing for an arbitrary number of causal variants.

In a video of his presentation, Eskin summarizes the progress on these problems.  A video of Eskin’s presentation may be found on the QCBio website:

More details about our research in fine mapping are available in the following papers:

Hormozdiari, Farhad; van de Bunt, Martijn; Segrè, Ayellet V; Li, Xiao; Joo, Jong Wha J; Bilow, Michael; Sul, Jae Hoon; Sankararaman, Sriram; Pasaniuc, Bogdan; Eskin, Eleazar

Colocalization of GWAS and eQTL Signals Detects Target Genes. Journal Article

In: Am J Hum Genet, 2016, ISSN: 1537-6605.

Abstract | Links | BibTeX

Hormozdiari, Farhad; Kichaev, Gleb; Yang, Wen-Yun Y; Pasaniuc, Bogdan; Eskin, Eleazar

Identification of causal genes for complex traits. Journal Article

In: Bioinformatics, 31 (12), pp. i206-i213, 2015, ISSN: 1367-4811.

Abstract | Links | BibTeX

Hormozdiari, Farhad; Kostem, Emrah ; Kang, Eun Yong ; Pasaniuc, Bogdan ; Eskin, Eleazar

Identifying causal variants at Loci with multiple signals of association. Journal Article

In: Genetics, 198 (2), pp. 497-508, 2014, ISSN: 1943-2631.

Abstract | Links | BibTeX

Hormozdiari F, Zhu A, Kichaev G, Ju CJ, Segrè AV, Joo JW, Won H, Sankararaman S, Pasaniuc B, Shifman S, Eskin E. Widespread allelic heterogeneity in complex traits. The American Journal of Human Genetics. 2017 May 4;100(5):789-802.

Involving undergraduates in genomics research to narrow the education-research gap

Serghei Mangul and Lana Martin, together with Eleazar Eskin, recently wrote a paper describing a model for training undergraduates in Bioinformatics. Our paper is available online as a preprint and is under review at a peer-reviewed journal.

The Education-Research Gap in Universities.

While the benefits of undergraduate research experiences (UREs) are recognized for undergraduates, the advantages of UREs for graduate students, post-doctoral scholars, and faculty are not clearly outlined.

Based on our experience mentoring undergraduates in ZarLab, we believe that the analysis of genomic data is particularly well-suited for successful involvement of undergraduates. In computational genomics research, undergraduate trainees who master a particular skill can contribute sufficient work to gain authorship on a peer-reviewed paper.

In our paper, we offer a framework for engaging undergraduates in genomics research while simultaneously improving lab productivity: first, identify particular “low-level” tasks that may take up to a week for an undergraduate to complete. Second, encourage students to “outsource” foundational education needs with workshops, online resources, and review articles. Third, genomics research labs can take advantage of department- and campus-wide undergraduate research and training initiatives.

The proposed strategy can be easily reproduced at other institutions, is pedagogically flexible, and is scalable from smaller to larger laboratory sizes. We hope that genomics researchers will involve undergraduates in more computational tasks that benefit both students and senior laboratory members.

Preprint copies of our manuscript are available for download here:

In tandem with this paper, we created an online catalogue of resources and papers aimed at bridging the research-teaching divide in computational genomics:

The full citation of our paper:
Mangul, S., Martin, L. and Eskin, E., 2017. Involving undergraduates in genomics research to narrow the education-research gap. PeerJ Preprints, 5, p.e3149v1.


Benefits of UREs to Research Lab and Undergraduates.

Applying meta-analysis to genotype-tissue expression data from multiple tissues to identify eQTLs and increase the number of eGenes

Dat Duong, a graduate student in our lab, developed a novel method that will help find more eQTLs and eGenes in gene expression data from many tissues. A paper presenting his method is published in an upcoming issue of Bioinformatics.

Genome-wide association studies (GWAS) seek links between single-nucleotide polymorphisms (SNPs) and traits or diseases. SNPs are the most commonly occurring sources of variation in the human genome. Many SNPs identified by GWAS are located in intergenic regions, stretches of DNA sequences located between genes. SNPs identified in these primarily noncoding regions often do not have an obvious relationship to the disease phenotype. Other lines of evidence, such as gene expression, are required to explore this relationship and learn about disease function.

Gene expression, an intermediate phenotype between a causal SNP and a disease, can be used to interpret positive results produced by a GWAS. Common data types include expression quantitative trait loci (eQTLs), genetic variants associated with gene expression in particular tissue types, and eGenes, genes whose expression levels are associated with genetic variants. Both eQTL studies and GWAS focus on SNPs, but eQTL studies may provide biological insights into the disease development mechanism. For this reason, we pay special attention to the variants that are eQTLs or eGenes and have strong association signals identified by GWAS.

Multi-tissue gene expression datasets like the Gene Tissue Expression (GTEx) data are used to find eQTLs and eGenes. However, these datasets have small sample sizes in some tissues. Many meta-analysis methods have been designed to increase power for finding eQTLs and eGenes by combining gene expression data across many tissues However, these techniques cannot scale to datasets containing many tissue types, like the GTEx data. Such methods also ignore a biological principle that the same variant may be associated with the same gene across similar tissues.


Venn diagram of the numbers of eGenes found by existing methods and RECOV, along with correlation matrices comparing methods. For more information, read our full paper.

To leverage the analytical power of eQTLs and eGenes in association studies, Duong and his team developed a new meta-analysis method named RECOV. Based on the principle that a SNP may have similar effect on the same gene in related tissues, RECOV can be applied to large gene expression datasets and can analyze all 44 tissues present in the GTEx data.

In our Bioinformatics paper, we use simulated datasets to show that RECOV has a correct false positive rate. When applied to real multi-tissue expression data from the GTEx dataset, RECOV detects 3% more eGenes than previous methods. RECOV is a general framework for meta-analysis that can be used with any COV matrix. We hope this software will be used by other researchers in the scientific community!

RECOV was developed by Dat Duong. The source code for RECOV is freely available at:

Our paper can be downloaded at Bioinformatics:


The full reference for our paper is:
Duong, D., Gai, L., Snir, S., Kang, E.Y., Han, B., Sul, J.H. and Eskin, E., 2017. Applying meta-analysis to Genotype-Tissue Expression data from multiple tissues to identify eQTLs and increase the number of eGenes. Bioinformatics, 33(14), pp.i67-i74.