HMMER 3 alpha test 2 release (3.0a2)

March 13th, 2009

hmmer-154x184

A new version of the HMMER3 alpha test code is available as a tarball available for download on our FTP site. Besides the source code, a binaries/ subdirectory contains precompiled binaries for Intel/Linux systems, both 64-bit (in binaries/intel-linux-x86_64) and 32-bit (in binaries/intel-linux-ia32), and you can just copy these into some directory in your PATH if you like. (That’s all a “make install” is going to do at the moment anyway.)

The release notes for 3.0a2:

HMMER 3.0a2 release notes (HMMER3 alpha test, release 2)
http://hmmer.org/
SRE, Thu Mar 12 13:38:45 2009
________________________________________________________________

The HMMER3 alpha test period continues with this second alpha test release, 3.0a2.

The alpha1 test code proved to be reasonably stable. alpha2 introduces a couple of major necessary features that did not make it into alpha1 due to schedule slippage. Provided the new stuff works and no horrific bugs have been introduced, alpha2 will likely be the final alpha release. We plan to upgrade to beta testing stage in May 2009.

3.0a2 includes the following large changes:

  • because of a bug fix (#h38; below) the format of hmmpress’ed HMM databases has changed. Any databases you have must be hmmpress’ed again.
  • The jackhmmer program (for iterative profile search) is now working, documented, and supported. We have not yet done any systematic internal benchmarking of its performance, but from anecdotal testing, we believe it to be performing reasonably well.
  • All the search programs (hmmsearch, hmmscan, phmmer, and jackhmmer) can now be called on multiple queries, rather than only single queries. For example, you can give hmmscan a FASTA file containing several protein sequences, and it will run searches on each one in turn.
  • All programs that use stochastic simulations now seed their random number generator(s) in a reproducible fashion by default, with sufficient granularity that a “work unit” (one calibrated HMM, or one comparison of an HMM to a sequence) will always gives reproducible results across runs, even in different programs (hmmscan vs. hmmsearch) or in different query or target files. The code refers to this as “suppression of run-to-run variability”. Each program has a “–seed ” option to set a different RNG seed, where “–seed 0″ is a special case that randomizes the seed and does a ‘proper’ stochastic simulation where run-to-run variability is expected.

Other changes include:

  • All search programs have “inclusion thresholds” in addition to “reporting thresholds”. Inclusion thresholds are controlled by options –incE, –incT, –incdomE, –incdomT, –inc_ga, –inc_nc, –inc_tc. Inclusion thresholds determine what matches are considered to be significant enough to warrant automatically calling the match “true” and homologous, as opposed to reporting thresholds which determine which matches are shown in output (typically reaching down into the top of the noise of false, nonhomologous hits). Inclusion thresholds control what goes into an output alignment (new -A option of hmmsearch, phmmer, jackhmmer), and most importantly, control what goes into each subsequent iteration of a jackhmmer run.
  • hmmsearch, phmmer, jackhmmer now have -A option to output an alignment of all significant hits (above inclusion thresholds).
  • hmmbuild now has a -O option to resave the alignment that it actually built the model from, as opposed to the input alignment; the alignment gets named if it didn’t already have one, sequences are weighted, a #=GC RF line is assigned to mark consensus/nonconsensus columns, and some residues may be slightly shifted to accommodate Plan7 architecture constraints that prohibit delete-insert and insert-delete transitions.
  • All four search programs are refactored, and are now based on a common internal implementation, controlled by a shared/consistent set of options. This means that some options have been changed or renamed; doublecheck -h for each program if you have a question about legal options.
  • the format of the statistics output at the end of the four search programs has changed slightly; doublecheck any parsers you may have written.
  • Automated test coverage has increased.
  • in Easel, the esl_stretchexp unit test is no longer subject to “normal” (stochastic) failures.
  • the configuration and Makefiles have been improved, so a “make” anywhere in HMMER’s subdirectory tree should rebuild all dependencies appropriately.

Numbered bugs fixed:

  • #h35 jackhmmer crashed because code wasn’t done.
  • #h36 hmmscan gave error message of (null).
  • #h37 hmmfetch failed on hmmpress’ed HMM files.
  • #h38 hmmscan failed to report some hits (rarely).
  • #h39 hmmsearch: esl_sqfile_Position() error when db is an -MSA
  • #h40 different runs of same program showed different results
  • #h41 hmmbuild shows duplicate lines in -h section
  • #e1 esl-alistat couldn’t read Pfam seeds in SELEX/mul format
  • #h45 search programs crashed on zero length sequences

Numbered bugs still open:

  • #e2 esl-sfetch shouldn’t always require SSI index of seqfile
  • #h42 PF00031 Rubredoxin fails to identify 4 seqs in its seed
  • #h43 all programs are reported to show memory free() warnings on Intel/Mac OS/X platforms. We have not yet been able to reproduce this.
  • #h44 “make” with icc on Mac OS/X shows compiler warnings including “option -xK not supported”. These are harmless, but we should improve the autoconf code that chooses ‘optimal’ compiler optimization flags.

Other bugs fixed (unnumbered because I found them, not you):

  • a number of memory issues detected by valgrind have been fixed.
  • hmmsearch, hmmscan didn’t work at all in HMMER compiled for MPI support.
  • Easel EMBL sequence parser wasn’t getting description line

15 responses

  1. Hiroshi comments:

    I still do not know how to hmmscan with –cut_ga option. It seems to me that the GA threshold value is set 0 when –cut_ga option is taken even though it’s set in the pfam profiles. Am I doing something wrong?

  2. Hiroshi comments:

    Now I know threshold stuffs are working. Sorry for the confusion. However, I had to modify the hmmscan.c to enable the inclusion thresholds because -A option has not been implemented in the hmmscan.

  3. Sean Eddy comments:

    Hi Hiroshi,
    it took me a bit to see what you’re referring to, but now I see it.

    Inclusion thresholds and the -A option don’t make any sense in hmmscan. Inclusion thresholds only make sense when there’s going to be a new alignment created of all “significant” hits (either in an iteration of jackhmmer, or because the user asked (with -A) to get an alignment output for all hits to a profile search. Since hmmscan searches a sequence query against a profile database, there’s no meaningful sense in which all the different profile hits can be aligned to the query and represented as a sequence alignment. So there’s no -A option in hmmscan, and won’t be.

    The problem you found is that despite what I just wrote, hmmscan -h shows you that hmmscan accepts –inc_nc, –inc_tc, and –inc_ga options; moreover, all these options are configured to require a nonexistent -A option to be selected, so if you try to use them, you get a cryptic internal error from the options parser. This is just a dumb mistake on my part; I copied the options configuration over from phmmer/hmmsearch. I’ve just fixed this in the dev code. In the future, hmmscan will not have –inc_nc, –inc_tc, and –inc_ga options.

    Another note, while I’m at it: it’s unlikely that you ought to be using GA, NC, and TC cutoffs, unless you’re a real pro. These are curated score thresholds that the Pfam curators use. They are highly dependent on the version of HMMER that the curators were using when they set the thresholds. All current versions of Pfam are curated for HMMER2′s scoring system. It won’t make sense to use Pfam GA, NC, or TC thresholds with HMMER3 models. To use these thresholds, you’d need to be curating your own HMMER3-specific values. Pfam will soon switch to HMMER3, but resetting all these thresholds is one of the big jobs they have to do before they’ll make the switch.

  4. Sean Eddy comments:

    Hi Hiroshi,

    You emailed me a bug report – let me answer here on the blog so other people see the answer.

    The 3.0a2 release has a serious bug in handling sequences containing degenerate residues (such as X). 3.0a1 did not have this bug. The bug is in the parameterization of the “bias filter”, a step in the acceleration pipeline that immediately follows the “MSV filter” and precedes the “Viterbi filter”. Target sequences containing degenerate residue characters will likely be filtered out of any search (hmmscan, hmmsearch, jackhmmer, phmmer) by the bias filter. I’ve now fixed the bug in the development code, and the fix will be part of the next release.

    To fix it for yourself in the meantime, go to easel/esl_hmm.c line 112:
    use_fq = (fq == NULL) ? uniform : fq[x];
    and delete it; then go to line 121:
    denom += use_fq;
    and replace that line with:
    denom += (fq == NULL) ? uniform : fq[y];
    and recompile.

    Thanks for finding and reporting this!

  5. Sean Eddy comments:

    While I’m at it…

    The other bug reported against 3.0a2 so far is that using the -A option to jackhmmer will cause a segfault. This has also been fixed in the dev code.

    To fix it for yourself, go to src/jackhmmer.c line 421:
    ESL_MSA *msa = NULL;
    and delete it; then recompile.

    (If only more bugs could be fixed by deleting a line of code…)

  6. Ying Huang comments:

    I used hmmsearch to scan my protein fasta file, and use -o and -A to output result. Because my fasta file is too big, therefore I split it into several small files and run hmmsearch. But I got segmentation fault problem for several files. I thought it maybe caused by memory assignment, but when I reduce the file size, the problem still exist. While I search the big file, no problem was reported, therefore it’s not related to file size. When I only used -o option, everything is fine. I checked my fasta file, I found that if I have proteins hit, then -o -A can work. If there are not protein hits, -A option will fail. I think the bug may caused by analysis of hmmsearch result to generate alignment.

  7. Alex Ochoa comments:

    A small HMM text format request:

    I’ve noticed your HMM files (even since Hmmer 2) have space for secondary structure annotation per match state. I think it would make sense to add an extra space for a number (a float to be general, though I’m sure an int could do too) that indicates the confidence of this SS annotation.

    I know using secondary structure for scoring is not a design goal of Hmmer 3, but other HMM packages use it in prediction (I can only think of HHsearch, although the feacute has been on PRC’s to do list for a while). This package uses *predicted* secondary structure, hence the need for a confidence score per annotated match state. (In HHsearch, the confidence scores come from PsiPred). A related profile, but non-HMM method, PROCAIN (http://nar.oxfordjournals.org/cgi/content/abstract/gkp212v1), also uses SS prediction. I’m sure more programs will use it in the future. I think having your format be compatible with that capability would make it a much better format, with a broader use and longer life.

  8. Sean Eddy comments:

    Explain more why you think this would be useful. I’m not seeing it. HMMER3 doesn’t do anything with secondary structure annotation, so what’s the utility of adding such annotation to H3′s model files?

  9. Alex Ochoa comments:

    The short answer is: the feature would be useful for *other HMM software* that may read your HMM H3 format files.

    I realize I’ll be repeating myself, but I’ll try to be clear about what’s on my mind. H2′s model files can be used by other HMM software (like HHsearch, PRC), so I expect the new H3 format to be interpretable in the future by software other than Hmmer3 as well. I think the “secondary structure prediction confidence” it’s a natural *optional* feature to add to a format that’s likely to be useful beyond Hmmer3, although *I agree Hmmer3 has no use for it*. I’m not saying that you have to make a format that is fully compatible with other groups’ packages, since agreeing on even the most general features might be impossible. Nevertheless, the secondary structure confidence is becoming a standard feature in profile-profile comparisons, and it’s very trivial to add (just an extra optional column next to the already optional SS annotation column). I understand that the other software packages also have their own native formats and they incorporate the features they need in them, like HHsearch does for the one I propose here for H3, but it’s annoying having HMMs in so many formats if you play with all these packages (like I do), so I thought this one feature in H3 would be particularly useful towards a more loseless interconversion of HMM formats. But I suppose the multiple HMM format problem won’t go away even if this feature is incorporated into H3′s format, so I wouldn’t give it a very high priority. It’s really just a thought I wanted to share, to see if it resonated with others.

  10. Sean Eddy comments:

    OK, thanks, I understand your point now.

    If we reach the point where H3′s format is a de facto standard, I’ll be happy to take this into account and coordinate the format with other developers who want to use it for other purposes. Certainly this kind of standards-driving centrality is a part of our master plan for world domination! But I think we’re far from that point right now. I have my hands full with issues relevant to H3, and am hesitant to start taking on too many downstream issues while H3 itself is still in flux.

    It isn’t entirely trivial to add, actually; for example, we’d want to define where the number is coming from, what it means, and how it’d be used. For example, just in the “where’s it coming from” category, H3 would be presumably propagate the number it obtains from an annotated MSA. Stockholm format currently includes markup for consensus secondary structure, but not for the reliability of that annotation; that would have to be added to Stockholm format, by consensus with Erik Sonnhammer and other gatekeepers of Stockholm format. Otherwise, H3 doesn’t even have any access to the numbers you’re talking about.

    Lossless interconversion of HMM formats isn’t likely to happen, by the way. The details of profile HMM architecture matter, and there’s no reason to expect different profile HMM software packages to use exactly the same model architecture.

  11. Peter L. Williams comments:

    Fatal exception (source file p7_hmmfile.c, line 619):
    ftello() failed
    Command failed: hmmsearch -o /scratch/out –domE 0.1 -E 0.1 allNFGX_0.30_hmms.all nps100.faa

    This is using: hmmer-3.0a2

  12. Sean Eddy comments:

    That’d usually indicate a system failure rather than a failure in HMMER; ftello() is trying to determine where the disk is as it’s reading the FASTA file, and if the disk glitches (even temporarily), this could happen. If it’s not reproducible, that’s probably what happened. If it’s HMMER, most likely it would happen every time. Is it reproducible? If so, email me a reproducible test case and I’ll look into it.

  13. Alex Ochoa comments:

    A bug report and a comment/suggestion on hmmalign

    The bug. hmmalign works well in this particular example without the –trim option, but with –trim this happens. Let me know if you need the input files to reproduce it (I haven’t tried to reproduce it with different inputs, but it’s reproducible with at least these two, even “0x09a5d1b0″ is always the same number).
    $ ./hmmalign –trim AP2.hmm PLAF7.fa
    *** glibc detected *** malloc(): memory corruption: 0x09a5d1b0 ***
    Aborted

    The comment/suggestion. Is there ever going to be an hmmalign option that automatically generates a Pfam-style full alignment? I’m thinking specifically of the case of feeding a whole genome into hmmalign (not just “family members”, trivial given the next point, –cut_ga), passing a threshold (e.g. –cut_ga) so only hits that pass it are displayed (instead of the best hit for every single protein, even if it’s terribly low scoring), and lastly, to get all “domains” to be aligned (as opposed to only the best domain per full protein). The last part is probably the hardest functionality to add, but I think it would be awesome to have it. I was thinking of generating these Pfam-style “full alignments” for phylogenetic analysis of domains, and getting these alignments with a single command without additional scripting is probably practical for more researchers than just me.

  14. Sean Eddy comments:

    Thanks. The hmmalign –trim memory corruption is now fixed in dev code and the fix will appear shortly when the beta1 release comes out in a few days. (It was trying to write posterior probability annotation to residues that were trimmed by the –trim option.)

    As for getting Full alignments in one step, you can do that now — the -A option to phmmer, hmmsearch, and jackhmmer will save the alignment of all the domains that pass “inclusion” thresholds. To get the inclusion thresholds to use the built-in Pfam GA thresholds, use –inc_ga. For example:

    % hmmsearch –inc_ga -A mydomain-full.sto mydomain.hmm uniprot.fa

  15. Kostas Tsirigos comments:

    Hello,
    I wanted to ask if there is a change in the way we used to search Pfam locally, by using the –cut_tc option in hmmpfam. I saw that there is also this option in the new hmmscan program, but, when I based my run on the TCs of the models I had, I got hits that didn’t actually had scores bigger than the TCs of the models.

    The command I ran was:
    hmmscan –cut_tc DB.hmm test.fasta

    and the top of my output file is:

    # hmmscan :: search sequence(s) against a profile HMM database
    # HMMER 3.0b2 (June 2009); http://hmmer.org/
    # Copyright (C) 2009 Howard Hughes Medical Institute.
    # Freely distributed under the GNU General Public License (GPLv3).
    # – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – -
    # query sequence file: test.fasta
    # target HMM database: DB.hmm
    # set reporting thresholds to: TC cutoffs
    # – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – -

    and for example, while for the model MOSP_C, the TCs in the model are: TC 25.00 26.90;

    I got a hit like the following:

    Query: A0B431_BURCH [L=212]
    Accession: A0B431
    Description: SubName: Full=Transcriptional regulator, TetR family;
    Scores for complete sequence (score includes all domains):
    — full sequence — — best 1 domain — -#dom-
    E-value score bias E-value score bias exp N Model Description
    ——- —— —– ——- —— —– —- — ——– ———–
    0.00031 13.0 0.0 0.00076 11.7 0.0 1.6 1 MOSP_C Major Outer Sheath Protein

    How is it possible?

    thank you.

Leave a comment