Illustration by Steven H. Lee. Thanks also to Leslie Gaffney, Broad Institute.
The past two decades have seen an exponential increase in sequencing capabilities, outstripping advances in computing power. Extracting new insights from the data sets currently being generated will require not only faster computers; it will require smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected; thus, the amount of novel sequence information is growing much more slowly.
We show that this redundancy can be exploited by compressing data in a way that allows direct computation on the compressed data. This approach reduces the computational task of operating on many highly similar genomes to only slightly more than that of operating on just one. We demonstrate this compressive architecture by implementing accelerated versions of both BLAST and BLAT, and emphasize how compressive genomics, more generally, will enable biologists to keep pace with current data.
We have implemented two prototype algorithms that demonstrate the compressive genomics paradigm: Compression-accelerated BLAST (CaBLAST) and Compression-accelerated BLAT (CaBLAT). These algorithms serve as proof-of-concept that computationally-aware compression not only reduces storage space but also accelerates analysis (in this case, sequence search).
Our source code can be downloaded here for academic and non-profit use:
- cablast-1.1.tar.gz is the release of CaBLAST, and does not rely on the NCBI toolkit. It is also available on GitHub.
- cast_v0.9.tar.gz is the publication version of CaBLAST and CaBLAT. It relies on the NCBI toolkit.
For a detailed description of the algorithms and discussion of relevant implementation trade-offs, please see the Supplementary Methods of our article "Compressive genomics" in Nature Biotechnology, July 2012.
If you use CaBLAST or CaBLAT, please reference the following:
- Loh P-R, Baym M, Berger B. Compressive genomics. Nature Biotechnology, Volume 30 Number 7, July 2012.