Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder

Variants regulating gene expression (expression quantitative trait loci, eQTL) are at a high frequency among SNPs associated with complex traits. Genome-wide characterization of gene expression is an important tool in genetic mapping studies of complex disorders, including many psychiatric disorders. Further, implicating eQTL to specific tissue types is key to understanding functional variation in disease development. Our group, in collaboration with Chiara Sabatti (Statistics, Stanford) and Nelson B. Freimer (David Geffen School of Medicine, UCLA), developed a novel approach for analyzing eQTL and applied the method to a dataset from a bipolar disorder study.

Current approaches to implicating eQTL specific to tissues lack sufficient power in large-scale studies of human brain related traits, such as bipolar disorder. Together with the University of California San Francisco, Universidad de Costa Rica, Universidad de Antioquia, Medellín, Colombia, and Tel Aviv University, our group adopted a novel approach to assess the heritability and genetic regulation of gene expression related to bipolar disorder in populations from Costa Rica and Colombia.

This project examines 786 genotyped subjects originally recruited in a study of bipolar disorder, all related within 26 extended families. While the subjects in this study were originally recruited as part of an investigation for severe bipolar disorder (BP1), we found no relationship between the observed gene expression data and BP1. Instead, we use this unique Latin American population to explore the architecture of genetic regulation. Specifically, we estimate heritability, evaluate the relative importance of local vs. distal genomic variation, identify variants with regulatory effects, and analyze the role of multiple associated SNPs in the same region.

Our group adopted a novel hierarchical testing procedure that leads to the analysis of eQTL data in a stage-wise manner with increasing levels of detail. This design allows us to compare estimates of the heritability of gene expression obtained using both traditional and genotype-based methods. First, we apply a multiscale testing strategy to identify SNPs that have regulatory effects (eSNPs) on BP1. Second, we investigate which specific probes are influenced by these eSNPs. This hierarchical testing procedure effectively controls error rates and leverages the heterogeneity across genetic variants to preserve computational power.

We use this approach to measure gene expression in lymphoblastoid cell lines (LCLs) in subjects from extended families, segregating for BP1. Our results suggest that variation in expression values is heritable and that, at least in samples including related individuals, relying on theoretical kinship coefficients or on realized genotype correlation for estimation of heritability leads to similar results.

Expression heritability and proportion of genetic variance due to local effects. For more information, see our paper. For more information, see our paper.

Variance decomposition approaches suggest that on average 30% of the genetic variance is due to local regulation. In the majority of probes under local regulation in our sample, more than one typed SNP is required to account for expression variation. This finding can be interpreted as the result of heterogeneity, but also could reflect un-typed causal variants that are tracked by more than one typed SNP.

The knowledge we acquired by studying the genetic regulatory network within these pedigrees, instead, can be used to inform our mapping studies: eSNPs might receive a higher prior probability of association, or be assigned a larger portion of the allowed global error rate when using a weighted approach to testing. We will report elsewhere on the results of these investigations.

For more information, see our paper, which is available for download through PLoS Genetics: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006046.

The full citation to our paper is: 

Peterson, C.B., Jasinska, A.J., Gao, F., Zelaya, I., Teshiba, T.M., Bearden, C.E., Cantor, R.M., Reus, V.I., Macaya, G., López-Jaramillo, C. and Bogomolov, M., 2016. Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder. PLoS Genet, 12(5), p.e1006046.

 

Simultaneous modeling of disease status and clinical phenotypes to increase power in GWAS

Michael Bilow and Eleazar Eskin, together with Fernando Crespo, Zhicheng Pan, and Susana Eyheramendy, recently released a novel method for accurate joint modeling of clinical phenotype and disease status. This approach incorporates a clinical phenotype into case/control studies under the assumption that the genetic variant can affect both.

Genetic case-control association studies have found thousands of associations between genetic variants and disease. Most studies collect data from individuals with and without disease, and they often search for variants with different frequencies between the groups. Jointly modelling clinical phenotype and disease status is a promising way to increase power to detect true associations between genetics and disease. In particular, this method increases potential for discovering genetic variants that are associated with both a clinical phenotype and a disease.

However, standard multivariate techniques fail to effectively solve this problem because their case-control status is discrete and not continuous. Standard approaches to estimate model parameters are biased due to the ascertainment in case/control studies. We present a novel method that resolves both of these issues for simultaneous association testing of genetic variants that have both case status and a clinical covariate.

In our paper, we show the utility of our method using data from the North Finland Birth Cohort (NFBC) dataset. NFBC enrolled almost everyone born in 1966 in Finland’s two most northern provinces. The NFBC dataset consists of 10 phenotypes and genotypes at 331,476 genetic variants measured in 5,327 individuals. We focus our study on the LDL cholesterol and triglyceride levels phenotypes.

Our evaluation strategy analyzes a subset of the NFBC data and compares what we discover here to what was discovered in the full NFBC dataset—which we treat as the gold standard. We compare the performance of our novel approach to three other methods: (1) the single univariate test applied to the disease status, (2) the multivariate approach applied to the disease status and the clinical phenotype modeled as a multivariate normal distribution, and (3) the liability threshold model treating the clinical phenotype as a covariate.

Using the univariate approach, the p-values are much weaker in comparison to those observed in the full NFBC dataset. Running the multivariate approaches, incorporating the triglyceride levels phenotypes, increased power (i.e., more significant p-values than SNPs).

Our method has the highest power in all scenarios. The advantage of our method is greater when there are substantial amounts of selection bias compared to lower amounts of selection bias. Our method is even more powerful when the correlation between the clinical covariate and the disease liability is lower, because we explicitly estimate the underlying liability using all of the data.

For more information, see our paper in Genetics: http://www.genetics.org/content/early/2017/01/27/genetics.116.198473

The software implementing the methods described in this paper was developed by Fernando Crespo and is available at: http://genetics.cs.ucla.edu/multipheno/ and
https://github.com/facrespo/BivariateProbitContinueEM

An illustration of the distribution of liability in a case-control study under selection bias. For more information, read our paper.

The full citation to our paper is:
Bilow, M., Crespo, F., Pan, Z., Eskin, E. and Eyheramendy, S., 2017. Simultaneous Modeling of Disease Status and Clinical Phenotypes to Increase Power in GWAS. Genetics, pp.genetics-116.

 

Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure

Jong Wha (Joanne) Joo developed an approach to simultaneously analyze multiple phenotypes in a genome-wide association studies (GWAS) dataset. She introduces this new methodology, referred to as GAMMA (Generalized Analysis of Molecular variance for Mixed model Analysis), in a paper recently published in Genetics.

GWASs have identified many genetic variants involved in traits and development of human diseases by examining for correlation of a single phenotype and individual genotype one phenotype at a time. Since initial development of the standard GWAS approach, GWAS data collection has become larger in scale and higher in resolution. Today’s large-scale datasets include expression data and often contain thousands of phenotypes per individual. Performing the standard single-phenotype analysis on these datasets is slow and potentially fails to detect unmeasured aspects of complex biological networks.

Analyzing many phenotypes simultaneously increases the power to detect more variants and capture previously unmeasured aspects of the genome. However, standard GWAS approaches capable of simultaneously testing multiple phenotypes fail to account for the distorting effects of population structure, a phenomenon present in large cohorts that inevitably contain individuals sharing common ancestry from multiple populations. As a result, standard GWAS approaches either fail to detect true effects or produce many false positive identifications.

GAMMA is an efficient, robust approach capable of simultaneously analyzing many phenotypes while correcting for population structure. GAMMA uses the principles behind existing linear mixed models to analyze for many phenotypes simultaneously and a multiple regression technique to correct for population structure.

Joanne’s paper presents the results of testing GAMMA for accuracy in three scenarios: a simulated dataset containing population structure, a yeast dataset containing many trans-regulatory hotspots, and a complex gut microbiome dataset. In the simulated study using data implanted with true population structure effects, GAMMA accurately identifies these true effects without producing false positives. In the simulation with yeast data, GAMMA successfully corrected for the bias of technical artifacts such as batch effects and identified significant signals on most of the putative hotspots. In the third test, Joanne and her team assesses GAMMA’s ability to perform a multiple-phenotypes analysis with microbiome data. Here, results identified nine loci likely to have true biological mechanisms in the taxa.

In each scenario, results of GAMMA were compared to those of the standard t-test, EMMA, and MDMR. The standard t-test and EMMA failed to identify true variants, because the phenotypic effects in each example is smaller than the amount these methods are powered to detect. MDMR produced no significant signals in the yeast dataset and identified many false associations in the simulated and gut microbiome datasets. Both GAMMA and MDMR have sufficient power to detect small association signals in these complex datasets, but only GAMMA successfully corrects for population structure.

This project was led by Joanne Joo and involved Eun Yong Kang and Farhad Hormozdiari. The article is available at: http://www.genetics.org/content/204/4/1379.

GAMMA was developed by Joanne Joo, Eun Yong Kang, Elin Org, Nick Furlotte, Brian Parks, Aldons J. Lusis, and Eleazar Eskin. Visit the following page to download GAMMA: http://genetics.cs.ucla.edu/GAMMA/

The full citation to our paper is:

Joo, Jong Wha J; Kang, Eun Yong; Org, Elin; Furlotte, Nick; Parks, Brian; Hormozdiari, Farhad; Lusis, Aldons J; Eskin, Eleazar

Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure. Journal Article

In: Genetics, 204 (4), pp. 1379-1390, 2016, ISSN: 1943-2631.

Abstract | Links | BibTeX

The results of GAMMA and three standard GWAS methods applied to a simulated dataset. The x-axis shows SNP locations and the y-axis shows log10p-value of associations between each SNP and all the genes. Blue arrows show the location of the true trans-regulatory hotspots.