Causal Variants Identification in Associated Regions

Figure 1 (A and B) Simulated data for two regions with different LD patterns that contain 35 SNPs. A and B are obtained by considering the 100 kbp upstream and downstream of rs10962894 and rs4740698, respectively, from the Wellcome Trust Case–Control Consortium study for coronary artery disease (CAD). (C and D) The rank of the causal SNP in additional simulations for the regions in A and B, respectively. We obtain these histograms from simulation data by randomly generating GWAS statistics using multivariate normal distribution. We apply the simulation 1000 times.

Figure 1 (A and B) Simulated data for two regions with different LD patterns that contain 35 SNPs. A and B are obtained by considering the 100 kbp upstream and downstream of rs10962894 and rs4740698, respectively, from the Wellcome Trust Case–Control Consortium study for coronary artery disease (CAD).

Our group in collaboration with our UCLA colleague Bogdan Pasaniuc’s group recently published two papers focusing on “statistical fine mapping”. We published a paper on a method called CAVIAR in the journal Genetics and Bogdan’s lab published a method called PAINTOR in PLoS Genetics. The software is available at http://genetics.cs.ucla.edu/caviar/ and http://bogdan.bioinformatics.ucla.edu/software/PAINTOR/.

Although genome-wide association studies have successfully identified thousands of regions of the genome which contain genetic variation involved in disease, only a handful of the biologically causal variants, responsible for these associations, have been successfully identified. Because of the correlation structure of genetic variants, in each region, there are many variants that are associated with disease. The process of predicting which subset of the genetic variants are actually responsible for the association is referred to as statistical mapping.

Current statistical methods for identifying causal variants at risk loci either use the strength of association signal in an iterative conditioning framework, or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus which is typically invalid at many risk loci. In our papers, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g. 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants.

Figure 2 Simulated association with two causal SNPs. (A) The 100-kbp region around the rs10962894 SNP and simulated statistics at each SNP generated assuming two SNPs are causal. In this example SNP25 and SNP29 are considered as the causal SNPs. However, the most significant SNP is the SNP27. (B) The causal set selected by CAVIAR (our method) and the top k SNPs method. We ranked the selected SNPs based on the association statistics. The gray bars indicate the selected SNPs by both methods, the green bars indicate the selected SNPs by the top k SNPs method only, and the blue bars indicate the selected SNPs by CAVIAR only. The CAVIAR set consists of SNP17, SNP20, SNP21, SNP25, SNP26, SNP28, and SNP29. For the top k SNPs method to capture the two causal SNPs we have to set k to 11, as one of the causal SNPs is ranked 11th based on its significant score. Unfortunately, knowing the value of k beforehand is not possible. Even if the value of k is known, the causal set selected by our method excludes SNP30–SNP35 from the follow-up studies and reduces the cost of follow-up studies by 30% compared to the top k method.

Figure 2 Simulated association with two causal SNPs. (A) The 100-kbp region around the rs10962894 SNP and simulated statistics at each SNP generated assuming two SNPs are causal. In this example SNP25 and SNP29 are considered as the causal SNPs. However, the most significant SNP is the SNP27. (B) The causal set selected by CAVIAR (our method) and the top k SNPs method. We ranked the selected SNPs based on the association statistics. The gray bars indicate the selected SNPs by both methods, the green bars indicate the selected SNPs by the top k SNPs method only, and the blue bars indicate the selected SNPs by CAVIAR only. The CAVIAR set consists of SNP17, SNP20, SNP21, SNP25, SNP26, SNP28, and SNP29. For the top k SNPs method to capture the two causal SNPs we have to set k to 11, as one of the causal SNPs is ranked 11th based on its significant score. Unfortunately, knowing the value of k beforehand is not possible. Even if the value of k is known, the causal set selected by our method excludes SNP30–SNP35 from the follow-up studies and reduces the cost of follow-up studies by 30% compared to the top k method.

From the CAVIAR paper:
Overview of statistical fine mapping

Our approach, CAVIAR, takes as input the association statistics for all of the SNPs (variants) at the locus together with the correlation structure between the variants obtained from a reference data set such as the HapMap (Gibbs et al. 2003; Frazer et al. 2007) or 1000 Genomes project (Abecasis et al. 2010) data. Using this information, our method predicts a subset of the variants that has the property that all the causal SNPs are contained in this set with the probability r (we term this set the “r causal set”). In practice we set r to values close to 100%, typically $95%, and let CAVIAR find the set with the fewest number of SNPs that contains the causal SNPs with probability at least r. The causal set can be viewed as a confidence interval. We use the causal set in the follow-up studies by validating only the SNPs that are present in the set. While in this article we discuss SNPs for simplicity, our approach can be applied to any type of genetic variants, including structural variants.

We used simulations to show the effect of LD on the resolution of fine mapping. We selected two risk loci (with large and small LD) to showcase the effect of LD on fine mapping (see Figure 1, A and B). The first region is obtained by considering 100 kbp upstream and downstream of the rs10962894 SNP from the coronary artery disease (CAD) case–control study. As shown in the Figure 1A, the correlation between the significant SNP and the neighboring SNPs is high. We simulated GWAS statistics for this region by taking advantage that the statistics follow a multivariate normal dis- tribution, as shown in Han et al. (2009) and Zaitlen et al. (2010) (see Materials and Methods). CAVIAR selects the true causal SNP, which is SNP8, together with six additional variants (Figure 1A). Thus, when following up this locus, we have only to consider these SNPs to identify the true causal SNPs. The second region showcases loci with lower LD (see Figure 1B). In this region only the true causal SNP is selected by CAVIAR (SNP18). As expected, the size of the r causal set is a function of the LD pattern in the locus and the value of r, with higher values of r resulting in larger sets (see Table S1 and Table S2).

We also showcase the scenario of multiple causal variants (see Figure 2). We simulated data as before and considered SNP25 and SNP29 as the causal SNPs. Interestingly, the most significant SNP (SNP27, see Figure 2) tags the true causal variants but it is not itself causal, making the selection based on strength of association alone under the assumption of a single causal or iterative conditioning highly suboptimal. To capture both causal SNPs at least 11 SNPs must be selected in ranking based on P-values or probabilities estimated under a single causal variant assumption. As opposed to existing approaches, CAVIAR selects both SNPs in the 95% causal set together with five additional variants. The gain in accuracy of our approach comes from accurately disregarding SNP30–SNP35 from consideration since their effects can be captured by other SNPs.

PAINTOR extended the CAVIAR model to also take into account the function of the genetic variation.

The full citations for the two papers are:

Sorry, no publications matched your criteria.

Accurate viral population assembly from ultra-deep sequencing data

Overview of high-fidelity sequencing protocol. (A) DNA material from a viral population is cleaved into sequence fragments using any suitable restriction enzyme. (B) Individual barcode sequences are attached to the fragments. Each tagged fragment is amplified by the polymerase chain reaction (PCR). (C) Amplified fragments are then sequenced. (D) Reads are grouped according to the fragment of origin based on their individual barcode sequence. An error-correction protocol is applied for every read group, correcting the sequencing errors inside the group and producing corrected consensus reads. (E) Error-corrected reads are mapped to the population consensus. (F) SNVs are detected and assembled into individual viral genomes. The ordinary protocol lacks steps (B) and (D)

Overview of high-fidelity sequencing protocol. (A) DNA material from a viral population is cleaved into sequence fragments using any suitable restriction enzyme. (B) Individual barcode sequences are attached to the fragments. Each tagged fragment is amplified by the polymerase chain reaction (PCR). (C) Amplified fragments are then sequenced. (D) Reads are grouped according to the fragment of origin based on their individual barcode sequence. An error-correction protocol is applied for every read group, correcting the sequencing errors inside the group and producing corrected consensus reads. (E) Error-corrected reads are mapped to the population consensus. (F) SNVs are detected and assembled into individual viral genomes. The ordinary protocol lacks steps (B) and (D)

Viral populations change rapidly throughout the course of an infection. Due to this, drugs that initially control an infection can rapidly become ineffective as the viral drug target mutates. For better drug design, however, we first must develop techniques to be able to detect and quantify the presence of various rare viral variants from a given sample.

Currently, next-generation sequencing technologies are employed to better understand and quantify viral population diversity. The existing technologies, however, have difficulty if distinguishing between rare viral variants and sequencing errors even when sequencing with high coverage. To overcome this problem, our lab has proposed a two-step solution in a recent paper by Serghei Mangul.

In his paper, Serghei suggested we first use a high-fidelity protocol known as Safe-SeqS with high coverage. This method employs the use of small individual barcodes that are attached to sequencing fragments before undergoing amplification by polymerase chain reaction (PCR) and being sequenced. By comparing and taking a consensus of amplicons from the same initial sequence fragment, we can easily eliminate some sequencing errors from our data.

These consensus reads then are assembled using an accurate viral assembly method Serghei developed known as the Viral Genome Assembler (VGA). This software uses read overlapping, SNV detection, and a conflict graph to distinguish and reconstruct genome variants in the population. Finally, an expectation-maximization algorithm is used to estimate abundances of assembled viral variants.

In the paper, this approach was applied to both simulated and real data and found to outperform current state-of-the-art methods. Additionally, this viral assembly method is the first of its kind to scale to millions of sequencing reads.

The Viral Genome Assembler tool is freely available here: http://genetics.cs.ucla.edu/vga/

From the paper:
Advances in NGS and the ability to generate deep coverage data in the form of millions of reads provide exceptional resolution for studying the underlying genetic diversity of complex viral populations. However, errors produced by most sequencing protocols complicate distinguishing between true biological mutations and technical artifacts that confound detection of rare mutations and rare individual genome variants. A common approach is to use post-sequencing error correction techniques able to partially correct the sequencing errors. In contrast to clonal samples, the post-sequencing error correction methods are not well suited for mixed viral samples and may lead to filtering out true biological mutations. For this reason, current viral assembly methods are able to detect only highly abundant SNV, thus limiting the discovery of rare viral genomes.

Additional difficulty arises from the genomic architectures of viruses. Long common regions shared across viral population (known as conserved regions) introduce ambiguity in the assembly process. Conserved regions may be due low-diversity population or due to recombination with multiple cross-overs. In contrast to repeats in genome assembly, conserved regions may be phased based on relative abundances of viral variants. Low-diversity viral populations in which all pairs of individual genomes within a viral population have a small genetic distance from each other may represent additional challenges for the assembly procedure.

We apply a high-fidelity sequencing protocol to study viral population structure (Fig. 1). This protocol is able to eliminate errors from sequencing data by attaching individual barcodes during the library preparation step. After the fragments are sequenced, the barcodes identify clusters of reads that originated from the same fragment, thus facilitating error correction. Given that many reads are required to sequence each fragment, we are trading off an increase in sequence coverage for a reduction in error rate. Prior to assembly, we utilize the de novo consensus reconstruction tool, Vicuna (Yang et al., 2012), to produce a linear consensus directly from the sequence data. This approach offers more flexibility for samples that do not have ‘close’ reference sequences available. Traditional assembly methods (Gnerre et al., 2011; Luo et al., 2012; Zerbino and Birney, 2008) aim to reconstruct a linear consensus sequence and are not well-suited for assembling a large number of highly similar but distinct viral genomes. We instead take our ideas from haplotype assembly methods (Bansal and Bafna, 2008; Yang et al., 2013), which aim to reconstruct two closely related haplotypes. However, these methods are not applicable for assembly of a large (a priori unknown) number of individual genomes. Many existing viral assemblers estimate local population diversity and are not well suited for assembling full-length quasi-species variants spanning the entire viral genome. Available genome-wide assemblers able to reconstruct full-length quasi-species variants are originally designed for low throughput and are impractical for high throughput technologies containing millions of sequencing reads.

Overview of VGA. (A) The algorithm takes as input paired-end reads that have been mapped to the population consensus. (B) The first step in the assembly is to determine pairs of conflicting reads that share different SNVs in the overlapping region. Pairs of conflicting reads are connected in the ‘conflict graph’. Each read has a node in the graph, and an edge is placed between each pair of conflicting reads. (C) The graph is colored into a minimal set of colors to distinguish between genome variants in the population. Colors of the graph correspond to independent sets of non-conflicting reads that are assembled into genome variants. In this example, the conflict graph can be minimally colored with four colors (red, green, violet and turquoise), each representing individual viral genomes. (D) Reads of the same color are then assembled into individual viral genomes. Only fully covered viral genomes are reported. (E) Reads are assigned to assembled viral genomes. Read may be shared across two or more viral genomes. VGA infers relative abundances of viral genomes using the expectation–maximization algorithm. (F) Long conserved regions are detected and phased based on expression profiles. In this example red and green viral genome share a long conserved region (colored in black). There is no direct evidence how the viral sub-genomes across the conserved region should be connected. In this example four possible phasing are valid. VGA use the expression information of every sub-genome to resolve ambiguous phasing.

Overview of VGA. (A) The algorithm takes as input paired-end reads that have been mapped to the population consensus. (B) The first step in the assembly is to determine pairs of conflicting reads that share different SNVs in the overlapping region. Pairs of conflicting reads are connected in the ‘conflict graph’. Each read has a node in the graph, and an edge is placed between each pair of conflicting reads. (C) The graph is colored into a minimal set of colors to distinguish between genome variants in the population. Colors of the graph correspond to independent sets of non-conflicting reads that are assembled into genome variants. In this example, the conflict graph can be minimally colored with four colors (red, green, violet and turquoise), each representing individual viral genomes. (D) Reads of the same color are then assembled into individual viral genomes. Only fully covered viral genomes are reported. (E) Reads are assigned to assembled viral genomes. Read may be shared across two or more viral genomes. VGA infers relative abundances of viral genomes using the expectation–maximization algorithm. (F) Long conserved regions are detected and phased based on expression profiles. In this example red and green viral genome share a long conserved region (colored in black). There is no direct evidence how the viral sub-genomes across the conserved region should be connected. In this example four possible phasing are valid. VGA use the expression information of every sub-genome to resolve ambiguous phasing.

We introduce a viral population assembly method (Fig. 2) working on highly accurate sequencing data able to detect rare variants and tolerate conserved regions shared across the population. Our method is coupled with post-assembly procedures able to detect and resolve ambiguity raised from long conserved regions using expression profiles (Fig. 2F). After a consensus has been reconstructed directly from the sequence data, our method detects SNVs from the aligned sequencing reads. Read overlapping is used to link individual SNVs and distinguish between genome variants in the population. The viral population is condensed in a conflict graph built from aligned sequencing data. Two reads are originated from different viral genomes if they share different SNVs in the overlapping region. Viral variants are identified from the graph as independent sets of non-conflicting reads. Non-continuous coverage of rare viral variants may limit assembly capacities, indicating that increase in coverage is required to increase the assembly accuracy. Frequencies of identified variants are then estimated using an expectation–maximization algorithm. Compared with existing approaches, we are able to detect rare population variants while achieving high assembly accuracy.

The full citation of our paper is:

Sorry, no publications matched your criteria.

Genes, Environments and Meta-Analysis

Figure 1. Application of Meta-GxE to Apoa2 locus. The forest plot (A) shows heterogeneity in the effect sizes across different studies. The PM- plot (B) predicts that 7 studies have an effect at this locus, even though only 1 study (HMDP-chow(M)) is genome-wide significant with P-value. doi:10.1371/journal.pgen.1004022.g001

It is well known that both genetic factors and environmental factors contribute to traits and specifically disease risk. In addition, an area of great interest in the research community is the interaction between genetic factors and environmental factors and their contribution to disease risk and other traits. Genetic variants that are involved in gene by environment interactions (denoted GxE) have a different effect on the trait spending on the environment. For example, some variants can have an effect on cholesterol levels only in the presence of a high fat diet. Discovering variants involved in GxE has been tremendously difficult and even though thousands of variants have been implicated in disease related traits using genome wide association studies, very few variants have been implicated in GxEs. Part of the difficulty in detecting GxEs is that the traditional approach requires analyzing studies which contain individuals with multiple environments.

We have recently published a paper with the A. Jake Lusis group in PLoS Genetics which presents a novel approach to discovering GxEs. In our approach, many different studies, each which was performed in different environments, are combined to identify GxEs. The key idea is that if variants have a different genetic effect in different environments, then these variants are candidates for being involved in GxEs. Combining studies together is a statistical technique called meta-analysis which has been a major focus of our lab the past few years. We show in the paper, the mathematically, searching for GxEs using the traditional approach and a type of meta-analysis framework called the random effects model(21565292) are very closely related.

We applied our approach to identify GxEs affected mouse HDL cholesterol by combining 17 mouse studies collected by A. Jake Lusis’ group containing almost 5,000 animals. Our approach discovered 26 loci involved in HDL, many of which appear to be involved in GxE. Virtually all of these loci were not previously discovered in any of the individual studies, but many of them map to genes known to affect HDL. Our approach also includes a visualization framework called a PM-plot which helps interpret the associated loci to help identify GxE interactions(22396665).

From the paper:

Discovering environmentally-specific loci using meta-analysis
The Meta-GxE strategy uses a meta-analytic approach to identify gene-by-environment inter- actions by combining studies that collect the same phenotype under different conditions. Our method consists of four steps. First, we apply a random effects model meta-analysis (RE) to identify loci associated with a trait considering all of the studies together. The RE method explicitly models the fact that loci may have different effects in different studies due to gene-by- environment interactions. Second, we apply a heterogeneity test to identify loci with significant gene-by-environment interactions. Third, we compute the m-value of each study to identify in which studies a given variant has an effect and in which it does not. Forth, we visualize the result through a forest plot and PM-plot to understand the underlying nature of gene-by-environment interactions.
We illustrate our methodology by examining a well-known region on mouse chromosome 1 harboring the Apoa2 gene, which is known to be strongly associated with HDL cholesterol (8332912). Figure 1 shows the results of applying our method to this locus. We first compute the effect size and its standard deviation for each of the 17 studies. These results are shown as a forest plot in Figure 1 (a). Second we compute the P-value for each individual study also shown in Figure 1 (a). If we were to follow traditional methodology and evaluate each study separately, we would declare an effect present in a study if the P-value exceeds a predefined genome-wide significance threshold (P < 1.0×10−6). In this case, we would only identify the locus as associated in a single study, HMDP-chow(M) (P = 6.84×10−9). On the other hand, in our approach, we combine all studies to compute a single P-value for each locus taking into account heterogeneity between studies. This approach leads to increased power over the simple approach considering each study separately. The combined meta P-value for the Apoa2 locus is very significant (4.41 × 10−22), which is consistent with the fact that the largest individual study only has 749 animals compared to 4,965 in our combined study.
We visualize the results through a PM-plot, in which P-values are simultaneously visualized with the m-values, which estimates the posterior probability of an effect being present in a study given the observations from all other studies, at each tested locus. These plots allow us to identify in which studies a given variant has an effect and in which it does not. M-values for a given variant have the following interpretation: a study with a small m-value(≤ 0.1) is predicted not to be affected by the variant, while a study with a large m-value(≥ 0.9) is predicted to be affected by the variant.
The PM-plot for the Apoa2 locus is shown in Figure 1 (b). If we only look at the separate study P-values (y-axis), we can conclude that this locus only has an effect in HMDP-chow(M). However, if we look at m-value (x-axis), then we find 8 studies (HMDPxB-ath(M), HMDPxB- ath(F), HMDP-chow(M), HMDP-fat(M), HMDP-fat(F), BxD-db-5(M), BxH-apoe(M), BxH- apoe(F)), where we predict that the variation has an effect, while in 3 studies (BxD-db-12(F), BxD-db-5(F), BxH-wt(M)) we predict there is no effect. The predictions for the remaining 6 studies are ambiguous.
From Figure 1, we observe that differences in effect sizes among the studies are remarkably consistent when considering the environmental factors of each study as described in Table 1. For example, when comparing study 1 – 4, the effect size of the locus decreases in both the male and female HMDPxB studies in the chow diet (chow study) relative to the fat diet (ath study). Thus we can see that when the mice have Leiden/CETP transgene, which cause high total cholesterol level and high LDL cholesterol level, effect size of this locus on HDL cholesterol level in blood is affected by the fat level of diet. Similarly, when comparing study 12 – 15, the knockout of the Apoe gene affects the effect sizes for both male and female BxH crosses. However, in the BxD cross (study 8 – 11), where each animal is homozygous for a mutation causing a deficiency of the leptin receptor, the effect of the locus is very strong in the young male animals, while as animals get older and become fatter, the effect becomes weaker. However in the case of female mice, the effect of the locus is nearly absent at both 5 and 12 weeks of age. Thus we can see that sex plays an important role in affecting HDL when the leptin receptor activity is deficient .

The full citation of our paper is:

Sorry, no publications matched your criteria.

Bibliography