Probabilistic models of protein and RNA sequences

Much of our work involves algorithm development for computational biology. Many of the tools we develop are based on probabilistic models of biological sequence and structure. We think about these models in the context of the Chomsky hierarchy of formal grammars. Hidden Markov models (stochastic regular grammars) are useful for primary structure analysis of proteins and DNA. Stochastic context-free grammars are ideal for analysis of RNA secondary structure.

Some of the tools that resulted from this work include the HMMER profile HMM search software and the Pfam protein domain database; the Infernal structural profile SCFG search software for RNAs and the Rfam RNA domain database; and the QRNA genefinding program for ncRNA genes.

Genome analysis

We apply probabilistic modeling and other computational algorithms to identify interesting genetic features in large-scale DNA sequence. The group collaborates closely with genome sequencing groups at the Genome Sequencing Center at Washington University and the Sanger Institute in Cambridge, England. We were one of the groups on the genome analysis team for the Human Genome Project. We have continued to be active in a number of collaboratory genome analysis projects, including rice, the nematode C. elegans, the archaeon Pyrococcus furiosus, and the ciliate Oxytricha trifallax.

The modern RNA world

One of our primary intellectual interests right now is in identifying novel structural and catalytic RNAs. The "ancient RNA world" hypothesis asserts that an ecosphere of RNA-based life preceded protein/DNA based life. It is widely argued that many of the RNA genes (tRNA, rRNA, catalytic introns) that we see today are ancient relics of the RNA world. If this is true, we hope that we might be able to learn something about the origins of life by identifying new RNA genes and studying their evolutionary history. Screening for new RNA genes is an interesting challenge; classical genetics can identify new genes based on their functional phenotype, but not based on what material their product is made of. We are taking the approach of identifying new noncoding RNA genes by looking for them directly in genome sequence data, using computational genetics and algorithmic screens. What we seem to be finding is that the RNA World model is pessimistic: far from being a few scattered relics, RNAs are in fact in widespread use in modern organisms in a variety of roles. We have argued for a "modern RNA world" hypothesis: many of the RNAs we see today are modern inventions, highly adapted to regulatory roles in complex organisms.