HMMER, reloaded

December 14th, 2008

A joy it will be one day,
to remember even this.
Through so many hard straits, so many
twists and turns,
our course holds firm for Latium.
There fate holds out
a homeland, calm, at peace.
There the gods decree
the kingdom of Troy will rise again.
Bear up.

- Virgil, The Aeneid

Four years ago today, I started from a clean slate on a new version of HMMER, HMMER 3.0. It’s been gestating longer than I’d planned, but at long last it’s almost ready. After Christmas holidays, the first alpha test versions will be released.

It’s far from actually done, mind you. But it’s about ready to be tested.

The idea that drove me to spend the last four years rewriting HMMER from scratch (again) is pretty simple. BLAST has been the workhorse of computational molecular biology for almost twenty years. The theoretical foundation of biological sequence analysis has been greatly improved since BLAST was written, because of the advent of probabilistic modeling methods such as hidden Markov models. We even wrote a book about these methods — Biological Sequence Analysis is now a decade old. But people still use BLAST; I still use BLAST. Why is that? Theoretical advances in a computational science are all well and good, but someone has to write a practical implementation if these advances are going to be widely used. BLAST is a damn fine implementation; fast, robust, and beautifully supported by NCBI. There are reasonable implementations of HMM methods, including HMMER2 and the UCSC SAM package, but mostly because they’re so much slower than BLAST, they have stayed in a different niche — both in how they’re used (for profile searches in protein domain databases such as Pfam and SMART) and in terms of mind share. HMMs are still seen as a black box, not as an improved statistical foundation for all of sequence analysis. David Lipman once commented that the only thing that made HMMs interesting was their name – there’s something’s hidden, and a Russian is involved.

So: all this gorgeous theory was languishing, and it was starting to piss me off.

So: HMMER3′s ever-so-modest goal is to compete with BLAST.

The most immediately visible change in HMMER3 is that HMMER is now about as fast as BLAST. We’ve got a spiffy new acceleration algorithm, a little less heuristic than BLAST’s and a little more suited to parallelization on modern hardware. We will be able to wring even more speed out of it as our initial implementation improves, and I’m expecting we should be able to make it even faster than BLAST by the time we’re done.

The next most visible change in HMMER3 is that it doesn’t calculate optimal alignment scores, it calculates theoretically more powerful log-likelihood ratios that sum over the uncertainty of any particular alignment. I’ll expand on this more in future posts (and papers), but the main idea is optimal alignment scores are the wrong score to use; they’re an approximation that works ok when alignments are pretty certain, but the approximation breaks down and optimal alignment scores lose resolving power when we’re comparing remote homologs. HMMER3 is using full probabilistic inference; it calculates the entire ensemble of possible alignments and reports confidence values (posterior probabilities) on every aligned residue. None of this is new! Theoretically speaking, we’ve known this is the Right Thing To Do since the 1990′s. The UCSC SAM implementation has always run the Forward algorithm; and that’s why it’s even slower than HMMER2. Now in H3, we’re doing the correct probabilistic inference calculations in context of a fast, practical codebase. Our acceleration algorithms are sufficiently good that we can finally afford to deploy the expensive heavy artillery of probabilistic inference, in a way that you won’t notice the cost; your results will snap back about as fast as BLAST ever did.

The result is sequence database searching that’s much more powerful than BLAST, or even HMMER2, but at BLAST’s speed. At least in our internal benchmarks so far, the new power in HMMER3 is dramatic.

I’ll talk more about what’s new in HMMER3 in the coming weeks, as we prepare to roll out the alpha test versions, and start laying a road map for public release.

9 responses

  1. Blogging Professors: Big Boffins with Blogs « O’Really? pings back:

    [...] Title to be announced [...]

  2. Linkfest - Dec 15, 2008 : business|bytes|genes|molecules pings back:

    [...] Let’s start with the news that Sean Eddy has started blogging. Cryptogenomicon is a suitably geeky name for a blog I am looking forward to reading (Sean’s obviously a Neal Stephenson fan). He brings the blog to life by posting about the program he is best known for, HMMER. Apparently HMMER3 has been written from the ground up, and has a clear goal in mind; to compete with BLAST. Read all about what to expect in HMMER3. [...]

  3. Stinus Lindgreen comments:

    This sounds very cool! Can’t wait to try it out when the alpha is released.

    Best,
    Stinus

  4. Kyle Ellrott comments:

    Keep up the good work. I’m looking forward to seeing that alpha as well.

  5. Khader Shameer comments:

    Dear Eddy,

    Happy to see your blog.
    Looking forward for the Reloaded-HMMER.
    Hope my web apps that use HMMER-programs will run faster than ever before :) Great !

    K. Shameer

  6. Andreas Wallberg comments:

    Looking forward to give it a spin. Will it be possible to produce alignments or sets of alignments that describe alignment uncertainty from unaligned (nucleotide or aa ) sequences?

  7. Sean Eddy comments:

    It’s going to show alignment uncertainty on all the alignments it produces. That’s deeply intrinsic to how it’s doing the inference now. Implementation wise, the first useful applications will only be for protein (DNA is going to require more work; hopefully DNA applications will appear by late 2009), either for profile alignments or pairwise alignments. True de novo multiple alignment of initially unaligned sequences (as opposed to aligning a bunch of sequences to a given profile) may or may not come; we will be studying the excellent mutliple alignment programs that already exist, to see if we can outperform them, but HMMER has generally taken the view that de novo multiple alignment is a separable problem.

  8. Ryan Richt comments:

    Woohoo! Obviously alluded to in the last super-awesome Eddy paper about forward scores, but the insane performance optimizations are quite a surprise!

    Will HMMER3 use multi-core parallelization, OpenCL/CUDA parallelization, the PVM network parallelization or some combination?

    For next-gen sequencing applications we are still at the point where even WU-BLAST (now AB-BLAST http://www.advbiocomp.com/) is orders of magnitude too slow for aligning the 12 giga-bases of sequence from a single paired-end Illumina or SOLiD run.

    Are there any applications that would benefit from aligning each read separately using HMMER3 instead of some consensus version? Maybe ChIP-seq where binding motifs are what we are searching for? Or allele specific expression of RNA-seq where a consensus masks important information in diploid and polyploid genomes? Could HMMER ever be fast enough to align 200 million 100bp reads (one Illumina PE run a few months from now) in a reasonable amount of time on a modest number of processors to Pfam for “meta-genomic community as a bag-of-genes” type work? HMMER4?

  9. Martin Gollery comments:

    Will the input and output formats be identical to v2.3.2, or will we have to rewrite the parsers?

Leave a comment