Our group developed a novel method for detecting eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Identification of eGenes is increasingly important to studies of expression quantitative trait loci (eQTLs), the genetic variants that affect gene expression. Mapped eGenes help guide eQTL studies of complex human disease. However, standard approaches cannot efficiently detect these complex features in today’s large genomic datasets. Func-eGene, which we describe and test in a recent Bioinformatics paper, significantly increases the statistical power of existing association study methods and detects more eGenes in comparison to standard approaches.
Standard statistical methods for classifying a gene as an eGene first perform association testing at all variants near the gene of interest, then use a permutation test to conduct multiple-testing correction for results. The permutation test effectively corrects for potential biases introduced by multiple testing and obtains a p value for each gene. However, the permutation test is computationally inefficient when processing the increasingly large sample sizes of today’s eQTL datasets and has become a computational bottleneck in eQTL studies.
Our new approach, Func-eGene, incorporates genomic annotation of variants to improve the computing power of eQTL studies. Variants located near gene transcription sites (TSSs), or near some histone modifications, often regulate gene expression. Standard approaches do not consider genomic annotations, but we found that annotation of these variants can help locate and associate more causal variants using less time and computing power. In order to do this, we expand upon the standard multithreshold association test that specifies different significance thresholds for each variant when correcting for multiple testing. Func-eGene increases power by assigning lower significance thresholds to variants that are likely to contribute to gene expression.
However, this association test still depends on the time-consuming permutation test and requires a known prior based on annotation for genetic variants. Func-eGene avoids these difficulties by reducing runtime and selecting an appropriate prior. To reduce runtime, we replace the permutation test with the Mvn-sampling procedure described in Sul et al. (2015). To find an appropriate prior, we run a grid search over possible sets of scores assigned to annotation categories. Func-eGene then seeks a set of scores that maximizes the number of eGenes and uses a cross-validation strategy to avoid data re-use and over-fitting. Thus, there are two ways to apply Func-eGene to eQTL data. Permutation Func-eGene uses the traditional permutation test to calculate the null density of the observed statistic, whereas Mvn Func-eGene relies on the Mvn-sampling procedure.
We applied our method to the liver Genotype-Tissue Expression (GTEx) dataset. We used genomic annotations of the following variants: distance from TSSs, DNase hypersensitivity sites, and six histone modifications. Notably, the distance from TSS annotation detected the highest number of candidate eGenes; using this annotation, our new method discovered 50% more candidate eGenes when compared to the standard permutation method. Our simulations show that Func-eGene successfully control the rate of false-positive associations when using either the permutation or the Mvn procedure. However, implementing Func-eGene with a traditional permutation test is inefficient. Instead, we can obtain the same results with considerably faster runtime when using Mvn sampling.
This project was led by Dat Duong and involved Jennifer Zou, Farhad Hormozdiari, and Jae Hoon Sul. The article is available at: http://bioinformatics.oxfordjournals.org/content/32/12/i156.abstract
The full citation to our paper is:
FUNC-eGene was developed by Dat Duong and is available for download at: https://github.com/datduong/FUNC-eGene