HapIso: An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads

Recent advances in RNA sequencing technology can generate deep coverage data containing millions of reads. RNA-Seq data are used to identify genetic variants and alternatively spliced isoforms, a common mechanism for diversity in a gene, that may play a role in heritable traits and diseases. Using this type of data, connections can be drawn between genetic expression and one of the two parental haplotypes identified in a diploid organism’s transcript. In other words, we can potentially identify the parent from which an individual inherited a group of genes.

These multi-kilobase reads are longer than most transcripts and enable sequencing of complete haplotype isoforms. New computational methods are required for efficient analysis of this highly complex data. In a recent paper, we present HapIso (Haplotype-specific Isoform Reconstruction), a comprehensive method that can accurately reconstruct the haplotype-specific isoforms of a diploid cell. Our software package is the first method capable of reconstructing the haplotype-specific isoforms from long single-molecule reads.

HapIso uses splice mapping of long single-molecule reads to partition reads into two parental haplotypes. The single molecule reads entirely span the RNA transcripts and bridge the single nucleotide variation (SNV) loci across a single gene. To overcome gapped coverage and splicing structures of the gene, the haplotype reconstruction procedure is applied independently to regions of contiguous coverage that have been defined as transcribed segments. Restricted reads from the transcribed regions are partitioned into two local clusters using the 2-mean clustering. Using the linkage provided by the long single-molecule reads, we connect the local clusters into two global clusters. An error-correction protocol is then applied for the reads from the same cluster.

Discriminating the long reads into parental haplotypes allows HapIso to accurately calculate allele-specific gene expression and identify imprinted genes. Additionally, it has a potential to improve detection of the effect of cis– and trans-regulatory changes on gene expression regulation. Long reads allow access to genetic variation in regions previously unreachable by short read protocols and potentially lead to new insights in disease heritability.

We applied HapIso to publicly available single-molecule RNA-Seq data from the GM12878 cell line and circular-consensus (CCS) single-molecule reads generated by Pacific Biosciences platform. Our method discovered novel SNVs in regions that were previously unreachable by standard short read protocols, 53% of which follow Mendelian inheritance. HapIso detected 921 genes with both haplotypes expressed among 9,000 expressed genes. We observed 4,140 heterozygous loci corresponding to positions with non-identical alleles among inferred haplotypes. Additionally, we can theoretically identify recombinations in the transmitted haplotypes by checking the number of recombinations in the inferred haplotypes.

The open source Python implementation of HapIso was developed by Serghei Mangul and Harry (Taegyun) Yang, and the software package is freely available for download at https://github.com/smangul1/HapIso/.

This paper appears in Proceedings of the International Symposium on Bioinformatics Research and Applications (ISBRA-2016), which can be downloaded here: http://link.springer.com/chapter/10.1007%2F978-3-319-38782-6_7

Serghei Mangul and Harry Yang led this project, which involved Farhad Hormozdiari. The full citation to our paper is:

Mangul, Serghei; Yang, Harry; Hormozdiari, Farhad; Tseng, Elizabeth; Zelikovsky, Alex; Eskin, Eleazar (2016): HapIso: An Accurate Method for the Haplotype-Specific Isoforms Reconstruction from Long Single-Molecule Reads. In: Bioinformatics Research and Applications, pp. 80-92, Springer International Publishing, 2016. (Type: Book Chapter | Links | BibTeX)


Overview of HapIso.

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping

Meta-analyses of genome-wide association studies (GWASs) have become essential to identifying new loci associated with human diseases. We recently developed a novel framework that improves the accuracy and power of meta-analyses, which we describe in our recent Human Molecular Genetics paper. This framework can be applied to the fixed effects (FE) model, which assumes that effect sizes of genetic variants are constant across studies, and the random effects (RE) model, which assumes that effect sizes can be different among studies.

Almost all GWAS publications today employ meta-analysis methodologies, the majority of which assume that component studies are independent and that individuals among studies are unrelated. Yet many studies today use shared controls to reduce genotyping or sequencing cost. These “shared control” individuals can inadvertently overlap between multiple studies and, if not accounted for in the methodology, induce false associations in GWAS results. Most meta-analysis tools, including the RE model, cannot account for these overlapping subjects.

In our paper, we propose a general framework for adjusting association statistics to account for overlapping subjects within a meta-analysis. The key idea of our method is to transform the covariance structure of the data so it can be used in methods that strictly assume independence between studies. Specifically, our method decouples dependent studies into independent studies and adjusts association statistics to account for uncertainties in dependent studies. As a result, our approach enables general meta-analysis methods, including the FE and RE models, to account for overlapping subjects. Existing pipelines implementing these models can be reused for dependent studies if our framework is applied at the front end of the analysis procedure.


A simple example of our decoupling approach. Ω and ΩDecoupled are the covariance matrices of the statistics of three studies A, B and C before and after decoupling, respectively. The thickness of the edges denotes the amount of correlation between the studies. After decoupling, the size of the nodes reflects the information that the studies contain in terms of the inverse variance.

We tested our framework for accuracy and power with five simulated datasets, each containing 1000 to 5000 individuals and 10,000 shared controls. A standard approach produced an inflated number of false positive. Our decoupling method, which systemically accounts for overlapping individuals in meta-analysis, and a standard splitting method, which splits controls into individual studies, both correctly controlled for type 1 errors. The advantage of our framework is apparent when assessing power; in one scenario, we gained 25% power in accounting for overlapping subjects with the decoupling when compared to the splitting method.

Next, we assessed the potential of our framework in identifying casual loci shared by multiple diseases and leveraging information from multiple tissues to increase power for eQTL identification. The decoupling and splitting methods controlled false-positive rates and produced significant p-values at several previously identified candidate shared loci among the three autoimmune conditions present in the Wellcome Trust Case Control Consortium (WTCCC) data. In comparison to the splitting method, our decoupling framework increased the significance of p-values in the shared loci test and increased the number of discovered eQTLs by 19%.

Our approach is flexible and allows many meta-analysis methods, such as the RE model, to account for dependency between studies and overlapping subjects. We developed this approach to complement standard software packages in the meta-analysis of GWAS. This project was led by Buhm Han and involved Dat Duong and Jae Hoon Sul. The article is available at:

The full citation to our paper is:

Han, Buhm; Duong, Dat; Sul, Jae Hoon; de Bakker, Paul; Eskin, Eleazar; Raychaudhuri, Soumya (2016): A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping.. In: Hum Mol Genet, 2016, ISSN: 1460-2083. (Type: Journal Article | Abstract | Links | BibTeX)

Writing Tips: Results Subsections

The purpose of a Results section is to present, without interpretation, the key results of your research. Your paper does not need to include every result you obtained during your experiments. Results are “key” when they are relevant to addressing the research questions or hypotheses presented at the beginning of your paper.

We use the Results subsections to show the reader what types of outcomes they can expect when using the methodology that we present. In our papers, we write a “Methods Overview” as the first subsection of the Results section. (We discuss writing the “Methods Overview” subsection in a previous writing tips post.) Remaining subsections in your paper’s Results section present your findings in the form of text, figures, and tables.

Each Results subsection should make a specific point, and the subsection heading should be a succinct description of this message. Effective subsection headings declare a statement that communicates to the reader what the method is capable of doing or what types of data the method can be applied to. For example, in a recent paper published by our group, the heading of a subsection that demonstrates how a new GWAS approach controls for false positive results is: “Phenotype Imputation Controls Type 1 Error.”

Here, a two-paragraph Results subsection has a heading that tells the reader which specific type of analysis is discussed, since the paper presents a method that can be applied toward numerous different analytical tasks.

Cell type composition and diversity


We hypothesized that differences in microbial diversity may be linked to whole blood cell type composition. Since the actual cell counts were not available for these individuals, we used cell-proportion estimates derived from available DNA methylation data to test this hypothesis (Houseman et al. 2012; Aryee et al. 2014; Horvath and Levine 2015).


We assessed methylation data from 65 controls from our replication sample, and compared methylation-derived blood cell proportions to alpha diversity after adjusting for age, gender, RIN, and all technical parameters. We tested whether alpha diversity levels are associated to cell type abundance estimates. Our analysis shows one cell type, CD8+ CD28- CD45RA- cells, to be significantly negatively correlated with alpha diversity after correction for all other cell-count estimates (correlation = -0.41, P=7.3e-4, Figure S6, Table S6). These cells are T cells that lack CD8+ naïve cell markers CD28 and CD45RA and are thought to represent a subpopulation of differentiated CD8+ T cells (Koch et al. 2008; Horvath and Levine 2015). We observed that low alpha diversity correlates with high levels of this population of T cells cell abundance.


Total RNA Sequencing reveals microbial communities in human blood and disease specific effects

Mangul, Serghei; Loohuis, Loes Olde; Ori, Anil; Jospin, Guillaume; Koslicki, David; Yang, Harry Taegyun; Wu, Timothy; Boks, Marco; Lomen-Hoerth, Catherine; Wiedau-Pazos, Martina; Cantor, Rita; de Vos, Willem; Kahn, Rene; Eskin, Eleazar; Ophoff, Roel (2016): Total RNA Sequencing reveals microbial communities in human blood and disease specific effects.. In: BioRxiv, (057570), 2016. (Type: Journal Article | Abstract | Links | BibTeX)

For each subsection, we include one figure that illustrates the heading’s message. The figure’s legend (also referred to as a “caption”) can simply be the subsection heading with additional information explaining the methods and data involved in the visual output. It may be helpful to select a figure and write a legend before composing text for the subsection.

At this point, you could probably write an entire paper on each figure! In general, we limit the text in each Results subsection to one to two paragraphs. Here, we use the minimum amount of text that is necessary to walk our reader through the figure. Think about what the reader needs to know in order to start using the method for their own analysis. Relevant information includes the type of data used, analytical steps and parameters, and a summary of conclusions. In many cases, the subsection text and figure legend will be repetitive.

This one-paragraph section provides relevant results in terms of statistical parameters, numerical output, and a supplemental figure. This subsection gives the reader a good idea of what to expect if they want to incorporate this new approach in their own project.

Phenotype Imputation Controls Type I Error


We simulated datasets for multiple phenotypes under the null model where the variant we are testing has no effect (effect size of zero) toward the target phenotype. We computed the type I error under five different significance thresholds: 0.05, 0.01, 0.005, 5 3 10-6, and 5 3 10-8. We generated 100,000,000 simulated datasets that consist of 1,000 individuals. The type I error rates for our imputation method were 0.049, 0.0099, 0.00489, 4.90 3 10-6, and 4.89 3 10-8 for the significance thresholds of 0.05, 0.01, 0.005, 5 3 10-6, and 5 3 10-8, respectively. This indicates that the type I error is correctly controlled in our imputation method. The Northern Finland Birth Cohort dataset 13 was used to show that the type I error is controlled (see Figure S1). We plot the Q-Q plot of the Z-score for the imputed triglyceride (TG) phenotype from the Finland dataset. There is no inflation in the Q-Qplot as shown in Figure S1.


Imputing Phenotypes for Genome-wide Association Studies

Hormozdiari, Farhad; Kang, Eun Yong; Bilow, Michael; Ben-David, Eyal; Vulpe, Chris; McLachlan, Stela; Lusis, Aldons; Han, Buhm; Eskin, Eleazar (2016): Imputing Phenotypes for Genome-wide Association Studies.. In: Am J Hum Genet, 99 (1), pp. 89-103, 2016, ISSN: 1537-6605. (Type: Journal Article | Abstract | Links | BibTeX)

Bonus challenge: After you finish writing your paper, try to remove the sentence highlighting the result’s importance from the Figure caption.

The order in which you present your results can be organized in many different ways. Typically, ordering of subsections is not important for initial manuscripts. One simple approach is to order Results subsections sequentially to support the argument that you are building in your paper.

Here, we present another example of a Results subsection, including the description of a relevant figure. The subsection heading is making it clear to the reader that this part of the paper discusses applying ForestPMPlot, a visualization tool for analyzing meta-analysis studies, to eQTL data.

Application to multi-tissue eQTL analysis


One powerful application of our proposed framework is in multi-tissue eQTL analysis in the Genotype-Tissue Expression (GTEx) project. The GTEx project studies human gene expression and genetic regulation in multiple tissues, providing valuable insights into the mechanisms of gene regulation, which can lead to the new discovery of disease-related perturbations. In this project, genetic variation between individuals will be examined for correlation with differences in gene expression level to identify regions of the genome that influence whether, and by how much, a gene is expressed. In particular, examining multiple tissues can give us valuable insights into the genetic architecture of the regulatory mechanism, because many regulatory regions are known to act in a tissue specific manner (Ernst et al. 2011; Encode Project Consortium 2012). Hence, understanding the role of regulatory variants, and the tissues in which they act, is essential for the functional interpretation of GWAS loci and insights into disease etiology.


Figure 2 is an example of the output of ForestPMPlot for a multitissue eQTL study for SEMA3B gene (GTEx Consortium 2015). Examining both the forest plot and the PM-Plot allows us to obtain an insight into the tissue-specific genetics effects in eQTL analysis, which leads to the identification of three significant eQTL tissues (heart left ventricle, stomach, and thyroid). This example clearly shows that examining both the forest plot and the PM-Plot allows us to easily hypothesize that there is a specific group of studies showing tissue differences in eQTL analysis.


ForestPMPlot: A Flexible Tool for Visualizing Heterogeneity Between Studies in Meta-analysis

Kang, Eun Yong; Park, Yurang; Li, Xiao; Segrè, Ayellet; Han, Buhm; Eskin, Eleazar (2016): ForestPMPlot: A Flexible Tool for Visualizing Heterogeneity between Studies in Meta-analysis.. In: G3 (Bethesda), 6 (7), pp. 1793-8, 2016, ISSN: 2160-1836. (Type: Journal Article | Abstract | Links | BibTeX)

Below, we provide examples of several different types of figures that can illustrate the point of a Results subsection.

Example of a figure and figure caption that clearly illustrate and explain significance of results in a Results subsection (Hormozdiari et al. 2016).

Example of a figure and figure caption that clearly illustrate and explain significance of results in a Results subsection (Hormozdiari et al. 2016).


Example of a more complex figure and figure caption in a Results subsection, which aim to explain the advantages of a new visualization tool (Kang et al. 2016).

Example of a more complex figure and figure caption in a Results subsection, which aim to explain the advantages of a new visualization tool (Kang et al. 2016).


Example of a general schematic “Methods Overview” subsection figure in the Results section (Mangul et al. 2016).

Example of a general schematic “Methods Overview” subsection figure in the Results section (Mangul et al. 2016).