Applying meta-analysis to genotype-tissue expression data from multiple tissues to identify eQTLs and increase the number of eGenes

Dat Duong, a graduate student in our lab, developed a novel method that will help find more eQTLs and eGenes in gene expression data from many tissues. A paper presenting his method is published in an upcoming issue of Bioinformatics.

Genome-wide association studies (GWAS) seek links between single-nucleotide polymorphisms (SNPs) and traits or diseases. SNPs are the most commonly occurring sources of variation in the human genome. Many SNPs identified by GWAS are located in intergenic regions, stretches of DNA sequences located between genes. SNPs identified in these primarily noncoding regions often do not have an obvious relationship to the disease phenotype. Other lines of evidence, such as gene expression, are required to explore this relationship and learn about disease function.

Gene expression, an intermediate phenotype between a causal SNP and a disease, can be used to interpret positive results produced by a GWAS. Common data types include expression quantitative trait loci (eQTLs), genetic variants associated with gene expression in particular tissue types, and eGenes, genes whose expression levels are associated with genetic variants. Both eQTL studies and GWAS focus on SNPs, but eQTL studies may provide biological insights into the disease development mechanism. For this reason, we pay special attention to the variants that are eQTLs or eGenes and have strong association signals identified by GWAS.

Multi-tissue gene expression datasets like the Gene Tissue Expression (GTEx) data are used to find eQTLs and eGenes. However, these datasets have small sample sizes in some tissues. Many meta-analysis methods have been designed to increase power for finding eQTLs and eGenes by combining gene expression data across many tissues However, these techniques cannot scale to datasets containing many tissue types, like the GTEx data. Such methods also ignore a biological principle that the same variant may be associated with the same gene across similar tissues.

 

Venn diagram of the numbers of eGenes found by existing methods and RECOV, along with correlation matrices comparing methods. For more information, read our full paper.

To leverage the analytical power of eQTLs and eGenes in association studies, Duong and his team developed a new meta-analysis method named RECOV. Based on the principle that a SNP may have similar effect on the same gene in related tissues, RECOV can be applied to large gene expression datasets and can analyze all 44 tissues present in the GTEx data.

In our Bioinformatics paper, we use simulated datasets to show that RECOV has a correct false positive rate. When applied to real multi-tissue expression data from the GTEx dataset, RECOV detects 3% more eGenes than previous methods. RECOV is a general framework for meta-analysis that can be used with any COV matrix. We hope this software will be used by other researchers in the scientific community!

RECOV was developed by Dat Duong. The source code for RECOV is freely available at: https://github.com/datduong/RECOV.

Our paper can be downloaded at Bioinformatics: https://academic.oup.com/bioinformatics/article/33/14/i67/3953939/Applying-meta-analysis-to-genotype-tissue

 

The full reference for our paper is:
Duong, D., Gai, L., Snir, S., Kang, E.Y., Han, B., Sul, J.H. and Eskin, E., 2017. Applying meta-analysis to Genotype-Tissue Expression data from multiple tissues to identify eQTLs and increase the number of eGenes. Bioinformatics, 33(14), pp.i67-i74.

IPED2: Inheritance Path based Pedigree Reconstruction Algorithm for Complicated Pedigrees

Our group, in an effort led by former UCLA PhD student Dan He, developed an algorithm for reconstructing pedigrees with genotype data. This novel approach is presented in a paper recently published in IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Pedigree inference plays an important role in population genetics. Pedigrees, commonly known as family trees, represent genetic relationships between individuals of a family. A pedigree diagram provides a model to compute the inheritance probability for the observed genotype and encodes all possible inheritance options for an allele in an individual. Pedigree reconstruction methods face several challenges. First, there can be an exponential number of possible pedigree graphs, and, second, the number of unknown ancestors can become very large as the height of the pedigree increases.

Examples of sequentially labeling the half-sibling graph. For more information, see our paper.

Examples of sequentially labeling the half-sibling graph. For more information, see our paper.

Our project uses genotype data to reconstruct pedigrees with computational efficiency despite these challenges. Our previous method, IPED, is the only known algorithm scalable to large pedigrees with reasonable accuracy for cases involving both outbreeding and inbreeding. IPED starts from extant individuals and reconstructs the pedigree generation by generation backwards in time. For each generation, IPED predicts the pairwise relationships between the individuals at the current generation and create parents for them according to their relationships.

Existing methods, including IPED, only consider pedigrees with simple structure; they cannot handle populations where, for example, two children share only one parent. To improve pedigree reconstruction when populations have complex structure, we proposed the novel method IPED2. Our approach uses a new statistical test to detect half-sibling relationships and a new graph-based algorithm to reconstruct the pedigree when half-siblings are allowed.

In order to test the performance of our method on complicated pedigrees, we use simulated pedigrees with different parameter settings and, instead of genotype data, we simulate haplotypes
directly. Our experiments show that IPED2 outperforms IPED and two other existing approaches for cases where there are half-siblings.

To our knowledge, this is the first method that can, using just genotype data, reconstruct pedigrees with half-siblings and inbreeding. IPED2 is also scalable to large pedigrees. In future work, we would like to consider additional genetic actions, such as insertion, deletion, and replacement, to resolve the conflicts. We also plan to refine IPED2 to consider cases where genotypes of ancestral individuals are known and where genotypes of extant individuals that are not on the lowest generations are known.

For more information, see our paper, which is available for download through Bioinformaticshttp://ieeexplore.ieee.org/abstract/document/7888513/.

In addition, the open source implementation of IPED2, which was developed by Dan He, is freely available for download at http://genetics.cs.ucla.edu/Dan/Software/IPED2.html.

The full citation to our paper is:
He, D., Wang, Z., Parida, L. and Eskin, E., 2017. IPED2: Inheritance path based pedigree reconstruction algorithm for complicated pedigrees. IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder

Variants regulating gene expression (expression quantitative trait loci, eQTL) are at a high frequency among SNPs associated with complex traits. Genome-wide characterization of gene expression is an important tool in genetic mapping studies of complex disorders, including many psychiatric disorders. Further, implicating eQTL to specific tissue types is key to understanding functional variation in disease development. Our group, in collaboration with Chiara Sabatti (Statistics, Stanford) and Nelson B. Freimer (David Geffen School of Medicine, UCLA), developed a novel approach for analyzing eQTL and applied the method to a dataset from a bipolar disorder study.

Current approaches to implicating eQTL specific to tissues lack sufficient power in large-scale studies of human brain related traits, such as bipolar disorder. Together with the University of California San Francisco, Universidad de Costa Rica, Universidad de Antioquia, Medellín, Colombia, and Tel Aviv University, our group adopted a novel approach to assess the heritability and genetic regulation of gene expression related to bipolar disorder in populations from Costa Rica and Colombia.

This project examines 786 genotyped subjects originally recruited in a study of bipolar disorder, all related within 26 extended families. While the subjects in this study were originally recruited as part of an investigation for severe bipolar disorder (BP1), we found no relationship between the observed gene expression data and BP1. Instead, we use this unique Latin American population to explore the architecture of genetic regulation. Specifically, we estimate heritability, evaluate the relative importance of local vs. distal genomic variation, identify variants with regulatory effects, and analyze the role of multiple associated SNPs in the same region.

Our group adopted a novel hierarchical testing procedure that leads to the analysis of eQTL data in a stage-wise manner with increasing levels of detail. This design allows us to compare estimates of the heritability of gene expression obtained using both traditional and genotype-based methods. First, we apply a multiscale testing strategy to identify SNPs that have regulatory effects (eSNPs) on BP1. Second, we investigate which specific probes are influenced by these eSNPs. This hierarchical testing procedure effectively controls error rates and leverages the heterogeneity across genetic variants to preserve computational power.

We use this approach to measure gene expression in lymphoblastoid cell lines (LCLs) in subjects from extended families, segregating for BP1. Our results suggest that variation in expression values is heritable and that, at least in samples including related individuals, relying on theoretical kinship coefficients or on realized genotype correlation for estimation of heritability leads to similar results.

Expression heritability and proportion of genetic variance due to local effects. For more information, see our paper. For more information, see our paper.

Variance decomposition approaches suggest that on average 30% of the genetic variance is due to local regulation. In the majority of probes under local regulation in our sample, more than one typed SNP is required to account for expression variation. This finding can be interpreted as the result of heterogeneity, but also could reflect un-typed causal variants that are tracked by more than one typed SNP.

The knowledge we acquired by studying the genetic regulatory network within these pedigrees, instead, can be used to inform our mapping studies: eSNPs might receive a higher prior probability of association, or be assigned a larger portion of the allowed global error rate when using a weighted approach to testing. We will report elsewhere on the results of these investigations.

For more information, see our paper, which is available for download through PLoS Genetics: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006046.

The full citation to our paper is: 

Peterson, C.B., Jasinska, A.J., Gao, F., Zelaya, I., Teshiba, T.M., Bearden, C.E., Cantor, R.M., Reus, V.I., Macaya, G., López-Jaramillo, C. and Bogomolov, M., 2016. Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder. PLoS Genet, 12(5), p.e1006046.