Join Rob’s HMMER team

November 19th, 2012

hmmer-154x184

Rob Finn’s HMMER web services team is expanding. We’re looking for people to apply to two new positions to help Rob and Jody push forward on some important ideas for our services. We’re pushing in the direction of using more phylogenetic information (species trees) as we compute database homology searches and deliver the results — organizing everything on trees, rather than treating the protein database as a bag of unrelated sequences, as we (the community) have tended to do in the past. We’ll need help on the data visualization side (navigating search results organized on the tree of life), on the computing back end (accelerating our searches by searching representative subsets of complete proteomes, rather than “all” sequences — which will allow us to deliver fully interactive search times, measured in milliseconds), and on collaborative efforts with the primary protein sequence and genome data resources, as we (the community) get our data ecosystem organized around complete annotated genomes, not individual protein sequences. The positions, written in HR-speak, are advertised on HHMI’s web site here and here.

Congratulations Dr. Eddy

November 15th, 2012

I’ve always been jealous of the Eisen brothers. Finally, some parity. Congratulations to my brother Nicholas who passed his PhD defense in synthetic organic chemistry at University of Connecticut yesterday! A momentous occasion. Now for some interdisciplinary synthetic organic computational genomics.

Biological sequence analysis and probabilistic models: 24-27 March 2013

September 24th, 2012

Registration is open for a conference on “Biological sequence analysis and probabilistic models”, 24-27 March 2013, here at Janelia Farm. Katie Pollard (UCSF), Adam Siepel (Cornell), and I are the co-organizers. Janelia Farm conferences are small (~50 people), a nice size for conversation and thought. We’re likely to select about 15 more participants from open registration. For more information, including a current list of the invited speakers and a link to registration, see the Janelia conferences web page.

St. Louis

September 17th, 2012

A friend just sent this wonderful video of my favorite city, to make me wistful. We still get our coffee shipped from Kaldi’s, our old neighborhood coffee shop. The saying in the lab was “St. Louis: better than you’d think.” Go Cards!

Here is St. Louis from Anastasis Films on Vimeo.

ENCODE says what?

September 8th, 2012

So I read in the newspaper this week that the ENCODE project has disproven the idea of junk DNA. I sure wish I’d gotten the memo, because this week a collaboration of labs led by myself, Arian Smit, and Jerzy Jurka just released a new data resource that annotates nearly 50% of the human genome as transposable element-derived, and transposon-derived repetitive sequence is the poster child for what we colloquially call “junk DNA”.

The newspapers went on to say that ENCODE has revolutionized our understanding of noncoding DNA by showing that far from being junk, noncoding DNA contains lots of genetic regulatory switches. Well, that’s also odd, because another part of my lab is (like a lot of other labs in biology these days) studying the regulation of genes in a model animal’s brain (the fruit fly Drosophila). We and everyone else in biology have known for fifty years that genes are controlled by regulatory elements in noncoding DNA. (Well, I’ve only known for thirty years, not fifty, I admit — only since Mrs. Dell’Antonio kicked me out of high school biology class and gave me a molecular genetics textbook to read by myself.)

Now, with all respect to my journalist friends, I’ve learned not to believe everything I read in the newspapers. I figured I’d better read the actual ENCODE papers. This is going to take a while. I’ve only read the main Nature paper carefully so far (there’s 30+ of them, apparently, across multiple journals). But it’s already clear that at least the main ENCODE paper doesn’t say anything like what the newspapers say.

The ENCODE project and our existing knowledge of genomes are both vastly more substantial than the discussion the ENCODE authors are provoking in the press right now.
Read more »

Dfam: annotation of transposable elements with profile HMMs

September 3rd, 2012

We’re happy to announce the release of Dfam 1.0, a set of profile HMMs for genomic DNA annotation of transposable elements. This essentially constitutes an upgrade of repeat element annotation from using searches with single sequence consensuses to using searches with profile HMMs, now that the HMMER3 project has made DNA/DNA profile HMM searches sufficiently fast for whole genomes. Dfam is a collaboration between Jerzy Jurka and his Repbase resources (Genetic Information Research Institute), Arian Smit and his RepeatMasker software (Institute for Systems Biology, Seattle), the HMMER3 development team at Janelia Farm (particularly Travis Wheeler, leading nhmmer development), and the Xfam database consortium (particularly Rob Finn, here at Janelia). Among other effects of this work, we expect the widely used RepeatMasker software to include nhmmer, Dfam models, and profile HMM searches in the near future. A preprint of the first Dfam paper is available now on our preprint server, and the database itself is available for use at dfam.janelia.org.
Read more »

Infernal 1.1: RNA alignment and database search, 10,000x faster

June 30th, 2012

One of our lab’s goals is to make it possible to systematically search for homologs of RNAs in genomes, not just by looking for sequence conservation but also by looking for RNA secondary structure conservation. A powerful model framework for RNA structure/sequence comparison, called profile stochastic-context free grammars (profile SCFGs), was introduced in the mid-1990s both by Yasu Sakakibara and by us. But profile SCFG methods are among the most computationally intensive algorithms used in genome sequence analysis, requiring (in their textbook description, anyway) O(N^4) time and O(N^3) memory for an RNA of N residues. Profile SCFG implementations like our Infernal software have required immense computational power to get even the most basic sort of searches done.

We are happy to announce a new landmark in our work on these methods, with a new version of Infernal that is about 100x faster than the previous (1.0) version, and 10,000x faster than when Eric Nawrocki started working on making Infernal fast enough for routine use. Over at infernal.janelia.org, Eric has made available the first release candidate of Infernal 1.1, 1.1rc1, including source code and binaries for Linux and MacOS/X. A typical RNA homology search of a vertebrate genome that used to require a cpu-year can now be done in about an hour on a single CPU, or a few seconds on a cluster.

So really for the first time, Infernal has become practical for systematic RNA sequence analysis of whole genomes. Roughly speaking, Infernal 1.1 is running at a speed comparable to what HMMER2 ran at — we’ve brought the RNA search problem down from the utterly ridiculous to the merely difficult.

The next version of the Rfam RNA sequence family database will be the first to be computed entirely natively with Infernal RNA structure comparison, instead of using BLASTN as a prefilter. An all-vs-all comparison of all 2000 Rfam models against the entire EMBL DNA database (170 Gb) would take 30,000 cpu-years using Infernal 0.55; now with Infernal 1.1, that enormous Rfam compute is only going to take us about a day on Janelia’s cluster.

Like Infernal 1.0, 1.1 is achieving its speed by using profile HMMs as heuristic prefilters. Whereas 1.0 used HMMER2-like prefilters, 1.1 has now switched to using HMMER3‘s vector engine, sharing code with Travis Wheeler’s soon-to-be-announced nhmmer program for DNA/DNA comparison.

Happy RNA hunting — and don’t let anyone tell you that O(N^4) algorithms aren’t tractable!

More domains and motifs

June 20th, 2012

In the latest version of the HMMER website we have focused on enhancing the recognition and display of domains and motifs found in query sequences. To achieve this we added two new features to the site, additional HMM databases and simple motif detection.
Read more »

Interactive, iterative searches using jackhmmer

April 16th, 2012

It has been a couple of weeks now since we released jackhmmer on the HMMER website and so far (touch wood etc…), it seems to be performing as we had hoped – here on ‘the farm’ we are getting very excited with the results we are observing.  Read more »

The well-behaved journal

December 4th, 2011

Science is running a poll titled “The Well-Behaved Scientist” this week that asks “how should we promote publication of data that can be replicated and reproduced?” Of the ideas on their list — more funding from funding agencies, more rewards from institutions — conspicuous in its absence is the rather fundamental idea that the purpose of scientific journals, including Science, is to publish reproducible research.
Read more »