Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder

Variants regulating gene expression (expression quantitative trait loci, eQTL) are at a high frequency among SNPs associated with complex traits. Genome-wide characterization of gene expression is an important tool in genetic mapping studies of complex disorders, including many psychiatric disorders. Further, implicating eQTL to specific tissue types is key to understanding functional variation in disease development. Our group, in collaboration with Chiara Sabatti (Statistics, Stanford) and Nelson B. Freimer (David Geffen School of Medicine, UCLA), developed a novel approach for analyzing eQTL and applied the method to a dataset from a bipolar disorder study.

Current approaches to implicating eQTL specific to tissues lack sufficient power in large-scale studies of human brain related traits, such as bipolar disorder. Together with the University of California San Francisco, Universidad de Costa Rica, Universidad de Antioquia, Medellín, Colombia, and Tel Aviv University, our group adopted a novel approach to assess the heritability and genetic regulation of gene expression related to bipolar disorder in populations from Costa Rica and Colombia.

This project examines 786 genotyped subjects originally recruited in a study of bipolar disorder, all related within 26 extended families. While the subjects in this study were originally recruited as part of an investigation for severe bipolar disorder (BP1), we found no relationship between the observed gene expression data and BP1. Instead, we use this unique Latin American population to explore the architecture of genetic regulation. Specifically, we estimate heritability, evaluate the relative importance of local vs. distal genomic variation, identify variants with regulatory effects, and analyze the role of multiple associated SNPs in the same region.

Our group adopted a novel hierarchical testing procedure that leads to the analysis of eQTL data in a stage-wise manner with increasing levels of detail. This design allows us to compare estimates of the heritability of gene expression obtained using both traditional and genotype-based methods. First, we apply a multiscale testing strategy to identify SNPs that have regulatory effects (eSNPs) on BP1. Second, we investigate which specific probes are influenced by these eSNPs. This hierarchical testing procedure effectively controls error rates and leverages the heterogeneity across genetic variants to preserve computational power.

We use this approach to measure gene expression in lymphoblastoid cell lines (LCLs) in subjects from extended families, segregating for BP1. Our results suggest that variation in expression values is heritable and that, at least in samples including related individuals, relying on theoretical kinship coefficients or on realized genotype correlation for estimation of heritability leads to similar results.

Expression heritability and proportion of genetic variance due to local effects. For more information, see our paper. For more information, see our paper.

Variance decomposition approaches suggest that on average 30% of the genetic variance is due to local regulation. In the majority of probes under local regulation in our sample, more than one typed SNP is required to account for expression variation. This finding can be interpreted as the result of heterogeneity, but also could reflect un-typed causal variants that are tracked by more than one typed SNP.

The knowledge we acquired by studying the genetic regulatory network within these pedigrees, instead, can be used to inform our mapping studies: eSNPs might receive a higher prior probability of association, or be assigned a larger portion of the allowed global error rate when using a weighted approach to testing. We will report elsewhere on the results of these investigations.

For more information, see our paper, which is available for download through PLoS Genetics:

The full citation to our paper is: 

Peterson, C.B., Jasinska, A.J., Gao, F., Zelaya, I., Teshiba, T.M., Bearden, C.E., Cantor, R.M., Reus, V.I., Macaya, G., López-Jaramillo, C. and Bogomolov, M., 2016. Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder. PLoS Genet, 12(5), p.e1006046.


Simultaneous modeling of disease status and clinical phenotypes to increase power in GWAS

Michael Bilow and Eleazar Eskin, together with Fernando Crespo, Zhicheng Pan, and Susana Eyheramendy, recently released a novel method for accurate joint modeling of clinical phenotype and disease status. This approach incorporates a clinical phenotype into case/control studies under the assumption that the genetic variant can affect both.

Genetic case-control association studies have found thousands of associations between genetic variants and disease. Most studies collect data from individuals with and without disease, and they often search for variants with different frequencies between the groups. Jointly modelling clinical phenotype and disease status is a promising way to increase power to detect true associations between genetics and disease. In particular, this method increases potential for discovering genetic variants that are associated with both a clinical phenotype and a disease.

However, standard multivariate techniques fail to effectively solve this problem because their case-control status is discrete and not continuous. Standard approaches to estimate model parameters are biased due to the ascertainment in case/control studies. We present a novel method that resolves both of these issues for simultaneous association testing of genetic variants that have both case status and a clinical covariate.

In our paper, we show the utility of our method using data from the North Finland Birth Cohort (NFBC) dataset. NFBC enrolled almost everyone born in 1966 in Finland’s two most northern provinces. The NFBC dataset consists of 10 phenotypes and genotypes at 331,476 genetic variants measured in 5,327 individuals. We focus our study on the LDL cholesterol and triglyceride levels phenotypes.

Our evaluation strategy analyzes a subset of the NFBC data and compares what we discover here to what was discovered in the full NFBC dataset—which we treat as the gold standard. We compare the performance of our novel approach to three other methods: (1) the single univariate test applied to the disease status, (2) the multivariate approach applied to the disease status and the clinical phenotype modeled as a multivariate normal distribution, and (3) the liability threshold model treating the clinical phenotype as a covariate.

Using the univariate approach, the p-values are much weaker in comparison to those observed in the full NFBC dataset. Running the multivariate approaches, incorporating the triglyceride levels phenotypes, increased power (i.e., more significant p-values than SNPs).

Our method has the highest power in all scenarios. The advantage of our method is greater when there are substantial amounts of selection bias compared to lower amounts of selection bias. Our method is even more powerful when the correlation between the clinical covariate and the disease liability is lower, because we explicitly estimate the underlying liability using all of the data.

For more information, see our paper in Genetics:

The software implementing the methods described in this paper was developed by Fernando Crespo and is available at: and

An illustration of the distribution of liability in a case-control study under selection bias. For more information, read our paper.

The full citation to our paper is:
Bilow, M., Crespo, F., Pan, Z., Eskin, E. and Eyheramendy, S., 2017. Simultaneous Modeling of Disease Status and Clinical Phenotypes to Increase Power in GWAS. Genetics, pp.genetics-116.


Reference-free comparison of microbial communities via de Bruijn graphs

Microbial communities inhabiting the human body exhibit significant variability across different individuals and tissues and are likely play an important role in health and disease. Serghei Mangul and David Koslicki (Oregon State University) recently published a paper presenting a novel approach for characterizing microbial communities in metatranscriptomics studies. Koslicki developed this tool, which may help scientists explore the role microbiota play in disease development, especially when comparing microbiomes of healthy and disease subjects.

Identifying and characterizing the relative abundance of microbiota in different tissues is essential to better understanding the role of microbial communities in human health. Current approaches use reference databases to identify, classify, and compare microbial communities present in the individual host. However, existing databases are incomplete and rely on a limited compendium of reference genomes. Current reference-based approaches are unable to accurately determine microbial compositions to the extent that could be possible given the high resolution of data produced by today’s high throughput sequencing technology.

Framework of the study. For more information, download our paper.

Ideally, comparison of microbial communities across samples could circumvent this limiting classification step. Mangul and Koslicki recently developed EMDeBruijn, a reference-free approach that uses all available non-host microbial reads, not just those classified in reference databases, to compare microbial communities.

First, EMDeBruijn translates sequencing data to a de Bruijn graph, which represents overlaps between symbols in sequences. De Bruijn graphs are commonly used in de novo assembly of short read sequences to a genome, but have not yet been applied in a reference-free approach. EMDeBruijn then uses properties of the de Bruijn graphs to compare microbiome composition across individuals. This metric is reduced using the Earth Mover’s Distance (EMD), a statistic that can measure the distance between two probability distributions over a region.

In their recent paper, Mangul and Koslicki applied EMDeBruijn to study the composition and abundance levels of the microbial communities present in blood samples from coronary artery calcification (CAC) patients and controls. EMDeBruijn uses candidate microbial reads to differentiate between case (CAC-affected) and control (healthy) samples, and a filtered set of non-host reads are used to determine the composition of the blood microbiome. Hierarchical clustering using the EMDeBruijn metric successfully identifies several large clusters unique to samples from either health or control groups.

This study indicates the presence of the disease-specific microbial community structure in CAC patients, and points to the need for additional investigation of potentially causal relationships between the microbiome and CAC disease.

Using the same data set, Mangul and Koslicki compare the results of EMDeBruijn with those of current approaches. Existing computational methods, including MetaPhlAn and RDP’s NBC, discovered various microbial communities across the health and control samples. However, neither of these methods were able to identify any disease-specific patterns in the microbiome nor discriminate the samples into disease and healthy groups.

EMDeBruijn provides a powerful, species independent way to assess microbial diversity across individuals and subjects. For more information, see our paper, which was published in the Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics:

Code implementing this method is available at:

Visualization of the EMDeBruijn Distance. a) Pictorial representation of 2-mer frequencies for two hypothetical samples, S1 and S2. b) The 2-mer frequencies overlaid the de Bruijn graph B2(A ). c) Representation of the flow used to compute EMD2(S1; S2); dark arrows denote mass moved from the initial node to the terminal node. d) Result of applying the flow to the 2-mer frequencies of S1.

This project was a collaboration that started at the Mathematical and Computational Approaches in High-Throughput Genomics program held in Fall 2011 at the Institute of Pure and Applied Mathematics (IPAM). Our on-going Computational Genomics Summer Institute (CGSI; also co-organized by IPAM) was inspired by the 2011 program. Check out the 2017 CGSI website for a preview of this summer’s programs – the deadline for applications is February 1, 2017!

The full citation to our paper is:

Mangul S, Koslicki D. Reference-free comparison of microbial communities via de Bruijn graphs. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2016 Oct 2 (pp. 68-77). Association for Computing Machinery, New York.