Our group has recently published several papers on sequencing using DNA pools. These include two methods for obtaining genotypes from pools(10.1186/1471-2105-12-S6-S2), (10.1109/ACSSC.2012.6489173), a method for correcting for errors when mixing the DNA into pools(10.1007/978-3-642-37195-0_4), and a method for performing association for rare variants when the sequence data is collected using pools(10.1534/genetics.113.150169).
High-throughput sequencing (HTS) technology has decreased the cost of sequencing for one individual tremendously in the past few years, however to perform genome-wide association studies (GWAS) we need to collect large cohorts having the disease (called cases) and cohorts not having the disease (called controls). Unfortunately, performing whole genome sequencing for large cohorts is still very expensive.
The actual cost of sequencing a sample consists of two parts. The first part is the cost of preparing a DNA sample for sequencing which is refereed to as library preparation cost. Library preparation is also the most labor-intensive part of a sequencing study. The second part is the cost of the actual sequencing, which is proportional to the amount of sequence, collected which we refer to as the sequencing per-base cost. Technological advances are rapidly reducing the per-base cost of sequencing while the library preparation costs are more stable (Figure1).
Erlich et al. (10.1101/gr.092957.109) introduced the concept of DNA pooling. The basic idea behind this approach is that DNA from multiple individuals are pooled together into a single DNA mixture which is then prepared as a single library and sequenced. In this approach, the library preparation cost is reduced because one library is prepared per pool instead of one library per sample.
Pooling methods can be split into two categories. The first category puts each individual in only one pool and each pool consist of fixed number of individuals. These types of methods are referred to as non-overlapping pool methods. The second category puts each individual in multiple pools and use this information to recover each individual’s genotype. These methods are referred to as overlapping pool methods.
Many studies (10.1101/gr.088559.108), (10.1093/nar/gkq675) (10.1186/1471-2105-12-S6-S2) have shown using overlapping pools we can recover the rare SNPs with high accuracy. In our work, we develop two methods to detect the genotype of both rare and common variances from pool sequencing (10.1109/ACSSC.2012.6489173). The idea is that we take advantage of genotypes on a subset of the variants which is often available for these cohorts. Both methods tend to have better accuracy than imputation methods, which is the standard approach to predict the genotypes of variants which were not collected.
Pooling have been successful to detect the rare variants, which is the main reason many GWAS have used pooling to detect the rare casual SNPs ((10.1101/gr.094680.109), (10.1038/ng.952)). However, all these methods make the assumption that all individuals have the same abundance level in the pool. The abundance level for each individual is the fraction of the reads in a pool originated from that specific individual. We show in our paper (10.1007/978-3-642-37195-0_4) that this simple assumption is not true, and ignoring the fact that some individuals can have different abundance level can lead to spurious associations. In our paper, we describe a probabilistic model that can detect the abundance levels of individuals when genotype data on a subset of the variants is available. Furthermore, we extend the model to the case the genotype of one of individual is missing. We showed leveraging the linkage disequilibrium (LD) pattern decrease the error rate.
Finally, in another recent paper(10.1534/genetics.113.150169), we extend methods for implicating rare variants in disease to data which is collected using DNA sequencing pools.
The full citations of our four papers are below.
Rare Variant Association Testing Under Low-Coverage Sequencing. Journal Article
In: Genetics, 2013, ISSN: 1943-2631.
Research in Computational Molecular Biology, Tel-Aviv University Springer Berlin Heidelberg, 2013.
2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), IEEE, 2012, ISBN: 978-1-4673-5051-8.
In: BMC Bioinformatics, 12 Suppl 6 , pp. S2, 2011, ISSN: 1471-2105.