Compression-accelerated BLAST and BLAT

 


Illustration by Steven H. Lee. Thanks also to Leslie Gaffney, Broad Institute.

The past two decades have seen an exponential increase in sequencing capabilities, outstripping advances in computing power. Extracting new insights from the data sets currently being generated will require not only faster computers; it will require smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected; thus, the amount of novel sequence information is growing much more slowly.

We show that this redundancy can be exploited by compressing data in a way that allows direct computation on the compressed data. This approach reduces the computational task of operating on many highly similar genomes to only slightly more than that of operating on just one. We demonstrate this compressive architecture by implementing accelerated versions of both BLAST and BLAT, and emphasize how compressive genomics, more generally, will enable biologists to keep pace with current data.

Source Code

We have implemented two prototype algorithms that demonstrate the compressive genomics paradigm: Compression-accelerated BLAST (CaBLAST) and Compression-accelerated BLAT (CaBLAT). These algorithms serve as proof-of-concept that computationally-aware compression not only reduces storage space but also accelerates analysis (in this case, sequence search).

Our source code can be downloaded here for academic and non-profit use:

Note that our current implementations are prototypes; we anticipate that to achieve optimal performance, the code will need to be tailored to match the particular engineering trade-offs that arise in real-world applications. The code in its current form is intended primarily as a resource for developers interested in adapting it for specific applications (or working with us to build it into "industrial-strength" software for general use by practitioners).

For a detailed description of the algorithms and discussion of relevant implementation trade-offs, please see the Supplementary Methods of our article "Compressive genomics" in Nature Biotechnology, July 2012.

Contact

We welcome feedback, questions and suggestions. Contact information is available at the authors' websites: Po-Ru Loh, Michael Baym, Bonnie Berger.

Referencing CaBLAST/CaBLAT

If you use CaBLAST or CaBLAT, please reference the following: