The Genetic Basis of Host Preference and Resting Behavior in the Major African Malaria Vector, Anopheles arabiensis

We recently published the first study to report a genetic component to host choice behavior in the major malaria vector Anopheles arabiensis. In a collaboration with the University of California Davis, University of Glasgow, and the Environmental Health and Ecological Sciences Group, Ifakara Health Institute, Ifakara, United Republic of Tanzania, we assess the genetic basis for An. arabiensis host choice and resting behavior. We link human-fed behavior to allelic variation between the 3Ra inversion states. This effort was led by researchers at UC Davis, including Bradley Main, Yoosook Lee, Travis Collier, Anthony Cornel, Catelyn Nieman, Allison Weakley, and Gregory Lanzaro. Eleazar Eskin and Eun Yong Kang contributed data analysis and interpretation.

Mosquitoes that feed on human blood pose an enormous public health threat by transmitting numerous pathogens, such as dengue virus, Zika virus, and malaria. Together, these mosquito-borne diseases kill more than one million people per year. Human exposure to malaria is driven by variable mosquito behaviors such as: (1) propensity to feed on humans relative to other animals (anthropophily) and (2) preference for living in close proximity to humans, as reflected by biting and residing inside houses (endophily).

Our project focused on the potential for An. arabiensis, the only remaining malaria vector in many parts of Africa, to adapt its behavior to avoid control measures such as insecticide-treated nets and indoor residual sprays. To investigate the genetic basis of host choice and resting behavior, we sequenced the genomes of 23 human-fed and 25 cattle-fed mosquitoes collected both in-doors and out-doors in the Kilombero Valley, Tanzania. We tested for genetic associations with each of the four phenotypes: human-fed, cow-fed, resting indoors, and resting outdoors.

With these genomes, we identified a set of 4,820,851 segregating SNPs after imposing a minor allele frequency threshold of 10%. We estimated the genetic component (or “SNP heritability”) for each phenotype. Results suggest a genetic component for host choice and no genetic component for resting behavior.

To test for the existence of genetic structure within our set of 48 sequenced genomes, individuals were partitioned by genetic relatedness using a Principle Component Analysis (Genome-Wide Complex Trait Analysis software, GCTA) applied to all SNPs. Using this approach, we observed three discrete genetic clusters. We used a novel population-scale inversion genotyping method to identify an association between the standard arrangement of 3Ra (3R+) and cattle-fed An. arabiensis. We highlight two intriguing candidate genes within the 3Ra, including the odorant binding protein Obp5, and the odorant receptor Or65. The enrichment of 3R+ among cattle-fed mosquitoes provides support for a genetic component to host choice, which is consistent with the report that zoophily can be selected for.

Genetic variation explained by the 2Rb and 3Ra inversions. For more information, see our paper.

Our multiplex genotyping assays allowed us to directly estimate relationships between host choice and genotype in wild mosquitoes in a high-throughput and economical fashion. Given the importance of mosquito feeding and resting behavior to the effectiveness of malaria control and transmission, there is an urgent need to understand the underlying biological determinants of these behaviors and their short- and long-term impact on the effectiveness of current public health interventions.

For more information, see our paper, which is available for download through PLoS Genetics:

The full citation to our paper is:
Main, B.J., Lee, Y., Ferguson, H.M., Kreppel, K.S., Kihonda, A., Govella, N.J., Collier, T.C., Cornel, A.J., Eskin, E., Kang, E.Y. and Nieman, C.C., 2016. The Genetic Basis of Host Preference and Resting Behavior in the Major African Malaria Vector, Anopheles arabiensis. PLoS Genet, 12(9), p.e1006303.


Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure

Jong Wha (Joanne) Joo developed an approach to simultaneously analyze multiple phenotypes in a genome-wide association studies (GWAS) dataset. She introduces this new methodology, referred to as GAMMA (Generalized Analysis of Molecular variance for Mixed model Analysis), in a paper recently published in Genetics.

GWASs have identified many genetic variants involved in traits and development of human diseases by examining for correlation of a single phenotype and individual genotype one phenotype at a time. Since initial development of the standard GWAS approach, GWAS data collection has become larger in scale and higher in resolution. Today’s large-scale datasets include expression data and often contain thousands of phenotypes per individual. Performing the standard single-phenotype analysis on these datasets is slow and potentially fails to detect unmeasured aspects of complex biological networks.

Analyzing many phenotypes simultaneously increases the power to detect more variants and capture previously unmeasured aspects of the genome. However, standard GWAS approaches capable of simultaneously testing multiple phenotypes fail to account for the distorting effects of population structure, a phenomenon present in large cohorts that inevitably contain individuals sharing common ancestry from multiple populations. As a result, standard GWAS approaches either fail to detect true effects or produce many false positive identifications.

GAMMA is an efficient, robust approach capable of simultaneously analyzing many phenotypes while correcting for population structure. GAMMA uses the principles behind existing linear mixed models to analyze for many phenotypes simultaneously and a multiple regression technique to correct for population structure.

Joanne’s paper presents the results of testing GAMMA for accuracy in three scenarios: a simulated dataset containing population structure, a yeast dataset containing many trans-regulatory hotspots, and a complex gut microbiome dataset. In the simulated study using data implanted with true population structure effects, GAMMA accurately identifies these true effects without producing false positives. In the simulation with yeast data, GAMMA successfully corrected for the bias of technical artifacts such as batch effects and identified significant signals on most of the putative hotspots. In the third test, Joanne and her team assesses GAMMA’s ability to perform a multiple-phenotypes analysis with microbiome data. Here, results identified nine loci likely to have true biological mechanisms in the taxa.

In each scenario, results of GAMMA were compared to those of the standard t-test, EMMA, and MDMR. The standard t-test and EMMA failed to identify true variants, because the phenotypic effects in each example is smaller than the amount these methods are powered to detect. MDMR produced no significant signals in the yeast dataset and identified many false associations in the simulated and gut microbiome datasets. Both GAMMA and MDMR have sufficient power to detect small association signals in these complex datasets, but only GAMMA successfully corrects for population structure.

This project was led by Joanne Joo and involved Eun Yong Kang and Farhad Hormozdiari. The article is available at:

GAMMA was developed by Joanne Joo, Eun Yong Kang, Elin Org, Nick Furlotte, Brian Parks, Aldons J. Lusis, and Eleazar Eskin. Visit the following page to download GAMMA:

The full citation to our paper is:

Joo, Jong Wha J; Kang, Eun Yong; Org, Elin; Furlotte, Nick; Parks, Brian; Hormozdiari, Farhad; Lusis, Aldons J; Eskin, Eleazar

Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure. Journal Article

In: Genetics, 204 (4), pp. 1379-1390, 2016, ISSN: 1943-2631.

Abstract | Links | BibTeX

The results of GAMMA and three standard GWAS methods applied to a simulated dataset. The x-axis shows SNP locations and the y-axis shows log10p-value of associations between each SNP and all the genes. Blue arrows show the location of the true trans-regulatory hotspots.

Discovering SNPs Regulating Human Gene Expression Using Allele Specific Expression from RNA-Seq data

Analyses of expression quantitative trait loci (eQTL), genomic loci that contribute to variation in genetic expression levels, are essential to understanding the mechanisms of human disease. These studies identify regulators of gene expression as either cis-acting factors that regulate nearby genes, or trans-acting factors that affect unlinked genes through various functions.  Traditional eQTL studies treat expression as a quantitative trait and associate it with genetic variation. This approach has identified many loci involved in the genetic regulation of common, complex diseases.

Standard eQTL methods are limited in power and accuracy by several phenomena common to genomic datasets. First, the correlation structure of genetic variation in the genome, known as linkage disequilibrium (LD), limits the ability of these methods to differentiate between the regulatory variant and neighboring variants that are in LD. Second, like other quantitative traits, the total expression of a gene is influenced by multiple genetic and environmental factors. The effect size for any given variant is therefore small, and standard methods require a large sample size to identify the effect.


ASE example and corresponding mathematical representation of three individuals (1, 2, 3). We assume that the third SNP is the causal SNP site affecting the differential gene expression level (Allele A/ Allele T).

Our forthcoming paper in Genetics presents a new method that improves the accuracy and computational power of eQTL mapping with incorporation of allele specific expression (ASE) analysis. Our novel method uses genome sequencing, alongside measurements of ASE from RNA-seq data, to identify cis-acting regulatory variants.

In standard eQTLs studies, the analysis of ASE is influenced by LD structure and the amount of allelic heterogeneity present in the genome. Individual effects appear weak since the effect of a variant is modest when compared to the variance of total expression. In our approach, the genotypes of each single individual with ASE provides information useful to determining variants causal for the observed ASE. Our approach actually leverages the relationship between LD and variant identification to map the variants affecting expression. Thus, analysis of ASE is advantageous over analysis of total expression levels, the standard approach to eQTL mapping.

We demonstrate the utility of our method by analyzing RNA-seq data from 77 unrelated northern and western European individuals (CEU). To map each gene, we simultaneously compare ASE measurements across a set of sequenced individuals. We then identify genetic variants that are in proximity to those genes and capable of explaining observed patterns of ASE. Here, we characterize the efficacy of this method as the ratio termed “reduction rate” and denoted as the ratio between the number of candidate regulatory SNPs to the total number of SNPs in the proximal region of the gene.

When applied to the CEU dataset, our method reduced the set of candidate SNPs from ten to two (a reduction rate of 80%). Allowing for one error increases the number of candidate SNPs to five and decreases the reduction rate to 50%. We also observe that the relationship between LD and variant identification has a different quality in ASE mapping when compared to eQTL studies, and produces different types of information useful to eQTL mapping studies.

ASE studies are a powerful approach to identifying associations between genetic variation and gene expression. Accurate measurement of ASE can identify cis-acting regulatory variants associated with common diseases. Our novel method for ASE mapping is based on a robust and computationally efficient non-parametric approach, and we hope it advances our understanding of functional risk alleles and facilitates development of new hypotheses for the causes and treatment of common diseases.

This project used software developed by Jennifer Zou, which is available for download at:

This project was led by Eun Yong Kang and involved Serghei Mangul, Buhm Han, and Sagiv Shifman. The article is available at:

The full citation to our paper is:

Kang, Eun Yong; Martin, Lisa; Mangul, Serghei; Isvilanonda, Warin; Zou, Jennifer; Ben-David, Eyal; Han, Buhm; Lusis, Aldons J; Shifman, Sagiv; Eskin, Eleazar

Discovering SNPs Regulating Human Gene Expression Using Allele Specific Expression from RNA-Seq Data. Journal Article

In: Genetics, 2016, ISSN: 1943-2631.

Abstract | Links | BibTeX