Incorporating prior information into association studies

Genome-wide association studies (GWAS) seek to identify genetic variants involved in specific traits. GWAS are advantageous for linking variants with traits, because they interrogate the genome in a uniform way. In other words, they examine the whole genome without a preconceived notion of where the associations may lie.

However, we now know a lot about the putative function of genetic variants due to tremendous progress in functional genomics. In many cases, we even know which variants are more likely to be involved in disease when compared to others. Advancements in our understanding of functional genomics motivate the strategic incorporation of prior information in GWAS.

Our group has been interested in this problem for many years. One challenge to addressing this problem is that the widely utilized approach for GWAS involves evaluating an association statistic at each single nucleotide polymorphism (SNP), and these methods take into account only one SNP at a time. The results are then adjusted for multiple testing, and an association is identified if a statistic exceeds a certain threshold. This approach can be described as a frequentist approach. On the other hand, one can incorporate prior information on which SNPs are likely to be the causal variants affecting the trait. This approach is inherently a Bayesian concept. Reconciling these two approaches is not straightforward.

Average power under varying relative risks. For more information, see our paper.

In a 2008 paper published in Genome Research, our group proposed a modification of the multiple testing framework to address this problem. Instead of using the same specific threshold for all of the association statistics, we use a different threshold for each association statistic, where the thresholds are adjusted based on the prior information. Our method takes advantage of the correlation structure by considering multiple markers within a region. In our paper, we demonstrate how to set the thresholds in order to optimally utilize prior information and maximize statistical power.

Using prior information in genetic association studies increases power over traditional association studies while maintaining the same overall false-positive rate. Compared to standard methods, our approach is equally simple to apply to association studies, produces interpretable results as p-values, and is optimal in its use of prior information in regards to statistical power.

In 2012, we extended this work to use only tag SNPs for the putative causal variant. This project was developed by Gregory Darnell (then UCLA undergraduate, now PhD student at Princeton University), Dat Duong (then UCLA undergraduate, now UCLA PhD student), and Buhm Han.

More recently, we have applied this framework to incorporate functional information in analysis of eQTL data. In this case, incorporating genomic annotation of variants significantly increases the statistical power of existing eQTL methods and detects more eGenes in comparison to standard approaches. Read the blog post on this paper, and download the full article.

For more information on our general approach, see our paper, which is available for download through Bioinformatics:
In addition, the open source implementation of our 2012 paper, MASA, which was developed by Greg Darnell and Dat Duong, is freely available for download at

The full citations to our papers on this topic are:

Darnell, Gregory; Duong, Dat ; Han, Buhm ; Eskin, Eleazar

Incorporating prior information into association studies. Journal Article

In: Bioinformatics, 28 (12), pp. i147-i153, 2012, ISSN: 1367-4811.

Abstract | Links | BibTeX

Eleazar Eskin. “Increasing Power in Association Studies by using Linkage Disequilibrium
Structure and Molecular Function as Prior Information.” Genome Research.
18(4):653-60 Special Issue Proceedings of the 12th Annual Conference on Research
in Computational Biology (RECOMB-2008), 2008.

Simultaneous modeling of disease status and clinical phenotypes to increase power in GWAS

Michael Bilow and Eleazar Eskin, together with Fernando Crespo, Zhicheng Pan, and Susana Eyheramendy, recently released a novel method for accurate joint modeling of clinical phenotype and disease status. This approach incorporates a clinical phenotype into case/control studies under the assumption that the genetic variant can affect both.

Genetic case-control association studies have found thousands of associations between genetic variants and disease. Most studies collect data from individuals with and without disease, and they often search for variants with different frequencies between the groups. Jointly modelling clinical phenotype and disease status is a promising way to increase power to detect true associations between genetics and disease. In particular, this method increases potential for discovering genetic variants that are associated with both a clinical phenotype and a disease.

However, standard multivariate techniques fail to effectively solve this problem because their case-control status is discrete and not continuous. Standard approaches to estimate model parameters are biased due to the ascertainment in case/control studies. We present a novel method that resolves both of these issues for simultaneous association testing of genetic variants that have both case status and a clinical covariate.

In our paper, we show the utility of our method using data from the North Finland Birth Cohort (NFBC) dataset. NFBC enrolled almost everyone born in 1966 in Finland’s two most northern provinces. The NFBC dataset consists of 10 phenotypes and genotypes at 331,476 genetic variants measured in 5,327 individuals. We focus our study on the LDL cholesterol and triglyceride levels phenotypes.

Our evaluation strategy analyzes a subset of the NFBC data and compares what we discover here to what was discovered in the full NFBC dataset—which we treat as the gold standard. We compare the performance of our novel approach to three other methods: (1) the single univariate test applied to the disease status, (2) the multivariate approach applied to the disease status and the clinical phenotype modeled as a multivariate normal distribution, and (3) the liability threshold model treating the clinical phenotype as a covariate.

Using the univariate approach, the p-values are much weaker in comparison to those observed in the full NFBC dataset. Running the multivariate approaches, incorporating the triglyceride levels phenotypes, increased power (i.e., more significant p-values than SNPs).

Our method has the highest power in all scenarios. The advantage of our method is greater when there are substantial amounts of selection bias compared to lower amounts of selection bias. Our method is even more powerful when the correlation between the clinical covariate and the disease liability is lower, because we explicitly estimate the underlying liability using all of the data.

For more information, see our paper in Genetics:

The software implementing the methods described in this paper was developed by Fernando Crespo and is available at: and

An illustration of the distribution of liability in a case-control study under selection bias. For more information, read our paper.

The full citation to our paper is:
Bilow, M., Crespo, F., Pan, Z., Eskin, E. and Eyheramendy, S., 2017. Simultaneous Modeling of Disease Status and Clinical Phenotypes to Increase Power in GWAS. Genetics, pp.genetics-116.


Reference-free comparison of microbial communities via de Bruijn graphs

Microbial communities inhabiting the human body exhibit significant variability across different individuals and tissues and are likely play an important role in health and disease. Serghei Mangul and David Koslicki (Oregon State University) recently published a paper presenting a novel approach for characterizing microbial communities in metatranscriptomics studies. Koslicki developed this tool, which may help scientists explore the role microbiota play in disease development, especially when comparing microbiomes of healthy and disease subjects.

Identifying and characterizing the relative abundance of microbiota in different tissues is essential to better understanding the role of microbial communities in human health. Current approaches use reference databases to identify, classify, and compare microbial communities present in the individual host. However, existing databases are incomplete and rely on a limited compendium of reference genomes. Current reference-based approaches are unable to accurately determine microbial compositions to the extent that could be possible given the high resolution of data produced by today’s high throughput sequencing technology.

Framework of the study. For more information, download our paper.

Ideally, comparison of microbial communities across samples could circumvent this limiting classification step. Mangul and Koslicki recently developed EMDeBruijn, a reference-free approach that uses all available non-host microbial reads, not just those classified in reference databases, to compare microbial communities.

First, EMDeBruijn translates sequencing data to a de Bruijn graph, which represents overlaps between symbols in sequences. De Bruijn graphs are commonly used in de novo assembly of short read sequences to a genome, but have not yet been applied in a reference-free approach. EMDeBruijn then uses properties of the de Bruijn graphs to compare microbiome composition across individuals. This metric is reduced using the Earth Mover’s Distance (EMD), a statistic that can measure the distance between two probability distributions over a region.

In their recent paper, Mangul and Koslicki applied EMDeBruijn to study the composition and abundance levels of the microbial communities present in blood samples from coronary artery calcification (CAC) patients and controls. EMDeBruijn uses candidate microbial reads to differentiate between case (CAC-affected) and control (healthy) samples, and a filtered set of non-host reads are used to determine the composition of the blood microbiome. Hierarchical clustering using the EMDeBruijn metric successfully identifies several large clusters unique to samples from either health or control groups.

This study indicates the presence of the disease-specific microbial community structure in CAC patients, and points to the need for additional investigation of potentially causal relationships between the microbiome and CAC disease.

Using the same data set, Mangul and Koslicki compare the results of EMDeBruijn with those of current approaches. Existing computational methods, including MetaPhlAn and RDP’s NBC, discovered various microbial communities across the health and control samples. However, neither of these methods were able to identify any disease-specific patterns in the microbiome nor discriminate the samples into disease and healthy groups.

EMDeBruijn provides a powerful, species independent way to assess microbial diversity across individuals and subjects. For more information, see our paper, which was published in the Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics:

Code implementing this method is available at:

Visualization of the EMDeBruijn Distance. a) Pictorial representation of 2-mer frequencies for two hypothetical samples, S1 and S2. b) The 2-mer frequencies overlaid the de Bruijn graph B2(A ). c) Representation of the flow used to compute EMD2(S1; S2); dark arrows denote mass moved from the initial node to the terminal node. d) Result of applying the flow to the 2-mer frequencies of S1.

This project was a collaboration that started at the Mathematical and Computational Approaches in High-Throughput Genomics program held in Fall 2011 at the Institute of Pure and Applied Mathematics (IPAM). Our on-going Computational Genomics Summer Institute (CGSI; also co-organized by IPAM) was inspired by the 2011 program. Check out the 2017 CGSI website for a preview of this summer’s programs – the deadline for applications is February 1, 2017!

The full citation to our paper is:

Mangul S, Koslicki D. Reference-free comparison of microbial communities via de Bruijn graphs. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2016 Oct 2 (pp. 68-77). Association for Computing Machinery, New York.