ENCODE says what?

September 8th, 2012

So I read in the newspaper this week that the ENCODE project has disproven the idea of junk DNA. I sure wish I’d gotten the memo, because this week a collaboration of labs led by myself, Arian Smit, and Jerzy Jurka just released a new data resource that annotates nearly 50% of the human genome as transposable element-derived, and transposon-derived repetitive sequence is the poster child for what we colloquially call “junk DNA”.

The newspapers went on to say that ENCODE has revolutionized our understanding of noncoding DNA by showing that far from being junk, noncoding DNA contains lots of genetic regulatory switches. Well, that’s also odd, because another part of my lab is (like a lot of other labs in biology these days) studying the regulation of genes in a model animal’s brain (the fruit fly Drosophila). We and everyone else in biology have known for fifty years that genes are controlled by regulatory elements in noncoding DNA. (Well, I’ve only known for thirty years, not fifty, I admit — only since Mrs. Dell’Antonio kicked me out of high school biology class and gave me a molecular genetics textbook to read by myself.)

Now, with all respect to my journalist friends, I’ve learned not to believe everything I read in the newspapers. I figured I’d better read the actual ENCODE papers. This is going to take a while. I’ve only read the main Nature paper carefully so far (there’s 30+ of them, apparently, across multiple journals). But it’s already clear that at least the main ENCODE paper doesn’t say anything like what the newspapers say.

The ENCODE project and our existing knowledge of genomes are both vastly more substantial than the discussion the ENCODE authors are provoking in the press right now.

The human genome has a lot of junk DNA

Genome size varies a lot. You might think that apparently more complex organisms like human would have more DNA than simpler organisms like single-celled amoebae but that turns out not to be true. Salamanders have 10-fold more DNA than us; lungfish, about 30-fold more.

So maybe we don’t really know how to define or measure “complexity”; maybe we’re just being anthropocentric when we think of ourselves as complex. Who’s to say that amoebae are less complex than humans? Ever looked at an amoeba? (They’re pretty awesome.) Still. The key observation isn’t just that very different creatures have very different genome sizes; it’s that similar species can have very different genome sizes. This fact, surprising at the time, begged a good explanation. If two species are similar, yet their genomes are 10x different in size, what’s all that extra DNA doing?

This observation about genome sizes (called the “C-value” paradox, for technical reasons) raised the idea that maybe genomes could expand (and shrink) rapidly (on an evolutionary timescale) as a result of some neutral (non-adaptive) processes — that maybe organisms could tolerate DNA that didn’t have a direct functional effect on the organism itself, but was instead was being created and maintained by neutral or even parasitic mechanisms of evolution. Somebody (it’s a good bet that T. Ryan Gregory knows who) dubbed this “junk” DNA, and that was probably an unfortunate term, because it’s incited people’s anger from the day it was coined. It’s not polite to tell someone their beautiful house is full of junk. Even if it is.

A key discovery that satisfactorily explained the C-value paradox was the discovery that genomes, especially animal and plant genomes, contain large numbers of transposable (mobile) elements that replicate all by themselves, often at the (usually slight) expense of their host genome. For instance, about 10% of the human genome is composed about a million copies of a small mobile element called Alu. Another big fraction of the genome is composed of a mobile element called L1. Transposons are related to viruses, and we think that for the most part they are parasitic in nature. They infect a genome, replicating, spreading, and multiplying; eventually they die, mutate, and decay away, leaving their DNA sequences. Sometimes when an Alu replicates and hops into a new place in our genome, it breaks something. Usually (partly because the genome is mostly nonfunctional) a new Alu just hops somewhere else in the junk, and has no appreciable effect on us.

So it turns out that when we look at all these different genome sizes, almost all of the puzzling size variation is explained by genomes having different “loads” of transposable elements. Some creatures, like pufferfish, have only low loads of transposons. Some creatures, like salamanders, lungfish, amoebae, corn, and lilies, are loaded with massive numbers of transposons. As it happens, the human genome is annotated as about 50% transposon-derived sequence — right at that 50/50 borderline where someone can say “the human genome is mostly junk” and someone else can say “the human genome is mostly not junk”.

In 1980, two key papers — by Orgel and Crick, and by Sapienza and Doolittle — nicely laid out the argument that genomes contain “selfish” or “junk” DNA, largely transposon-derived, sometimes quite large amounts of it. These papers are quite beautiful and scholarly. They are careful to say, for example, that it would be surprising if evolution did not sometimes co-opt useful functions from this great amount of extra DNA sequence slopping around. Indeed, we are now finding many interesting examples of transposon-derived stuff being co-opted for organismal function (but these are the exception, not the rule). Without trying to be snide or pedantically academic, I’ll note that the main ENCODE paper cites neither Orgel/Crick or Sapienza/Doolittle; what this means is, regardless of what we read in the newspapers, ENCODE is not actually trying to interpret their data in light of the current thinking about junk DNA, at least in the actual paper.

Transposon-derived sequences are the poster child for “junk DNA” because we can positively identify transposon-derived sequences by computational analysis, and reconstruct the evolutionary history of transposon invasions of genomes. There’s likely to be other nonfunctional DNA “junk” too, in the DNA that we can’t currently put any annotation at all on, but the key point is that the dead bones of many transposons are something we can affirmatively identify.

Noncoding DNA is part junk, part regulatory, part unknown

It is crucial to understand that “noncoding” DNA is not synonymous with “junk” DNA. The current view of the human genome, which ENCODE has now systematically and comprehensively confirmed and extended, is that it is about 1% protein-coding, in perhaps about 20,000 “genes” averaging about 1500 coding bases each (where the concept of a “gene” is amorphous, but useful; we know one when we see one). Genes are turned on and off by regulatory DNA regions, such as promoters and enhancers — as has been worked out over fifty years, starting with how bacterial viruses work. In animals like humans, most people (ok, I) would guess that there are maybe 10-20 regulatory regions per gene, each maybe 100-300 bases long; so, very roughly, maybe on the order of about 1000-6000 bases of noncoding regulatory information per 1500 coding bases in a gene. I’m only giving hand-wavy back of the envelope notions here because it’s actually quite difficult to pin these numbers down exactly; our current knowledge of regulatory DNA sequences in detail is distressingly incomplete. That’s something that ENCODE’s trying to help figure out, in systematic fashion, and where a lot of ENCODE’s substantive value is. The point is, we already knew there was likely at least as much regulatory DNA as coding DNA, and probably more; we just don’t have a very satisfying handle on it all yet, and we thought we needed an ENCODE project to survey things more comprehensively.

So when you read a Mike Eisen saying “those damn ENCODE people, we already knew noncoding DNA was functional”, and a Larry Moran saying “those damn ENCODE people, there is too a lot of junk DNA”, they aren’t contradicting each other. They’re talking about different (sometimes overlapping) fractions of human DNA. About 1% of it is coding. Something like 1-4% is currently expected to be regulatory noncoding DNA given what we know (and our knowledge about regulatory sites is especially incomplete). About 40-50% of it is derived from transposable elements, and thus affirmatively already annotated as “junk” in the colloquial sense that transposons have their own purpose (and their own own biochemical functions and replicative mechanisms), like the spam in your email. And there’s some overlap: some mobile-element DNA has been co-opted as coding or regulatory DNA, for example.

Now that still leaves a lot of the genome. What’s all that doing? Transposon-derived sequence decays rapidly, by mutation, so it’s certain that there’s some fraction of transposon-derived sequence we just aren’t recognizing with current computational methods, so the 40-50% number must be an underestimate. So most reasonable people (ok, I) would say at this point that the human genome is mostly junk (“mostly” as in, somewhere north of 50%).

At the same time, we still have only a tenuous grasp on the details of gene regulation, even though we think we understand the broad strokes now. Nobody should bet against finding more and more regulatory noncoding DNA, either. The human genome surely contains a lot of unannotated functional DNA. The purpose of the ENCODE project was to help us sort this out. Its data sets, and others like them, will be fundamental in giving us a comprehensive view of the functional elements of the human genome.

ENCODE’s definition of “functional” includes junk

ENCODE has assigned a “biochemical function” to 80% of the genome. The newspapers add, “therefore it’s not junk”, but that’s a critically incorrect logical leap. It presumes that junk DNA doesn’t have a “biochemical function” in the sense that ENCODE chose to operationally define “function”. So in what sense did ENCODE define the slippery concept of biological function, to allow them to assign a human genome fraction (to two significant digits, ahem)?

ENCODE calls a piece of DNA “functional” if it reproducibly binds to a DNA-binding protein, is reproducibly marked by a specific chromatin modification, or if it is transcribed. OK. That’s a fine, measurable operational definition. (One might wonder, why not just call “DNA replication” a function too, and define 100% of the genome as biochemically functional, but of course, as Ewan Birney (the ENCODE czar) would tell you, I would never be that petty. No sir.) I am quite impressed by the care that the ENCODE team has taken to define “reproducibility”, and to process their datasets systematically.

But as far as questions of “junk DNA” are concerned, ENCODE’s definition isn’t relevant at all. The “junk DNA” question is about how much DNA has essentially no direct impact on the organism’s phenotype – roughly, what DNA could I remove (if I had the technology) and still get the same organism. Are transposable elements transcribed as RNA? Do they bind to DNA-binding proteins? Is their chromatin marked? Yes, yes, and yes, of course they are – because at least at one point in their history, transposons are “alive” for themselves (they have genes, they replicate), and even when they die, they’ve still landed in and around genes that are transcribed and regulated, and the transcription system runs right through them.

Thought experiment: if you made a piece of junk for yourself — a completely random DNA sequence! — and dropped it into the middle of a human gene, what would happen to it? It would be transcribed, because the transcription apparatus for that gene would rip right through your junk DNA. ENCODE would call the RNA transcript of your random DNA junk “functional”, by their technical definition. And if even it weren’t transcribed, that would be because it acted as a different kind of functional element (your random DNA could accidentally create a transcriptional terminator).

The random genome project

So a-ha, there’s the real question. The experiment that I’d like to see is the Random Genome Project. Synthesize a hundred million base chromosome of entirely random DNA, and do an ENCODE project on that DNA. Place your bets: will it be transcribed? bound by DNA-binding proteins? chromatin marked?

Of course it will.

The Random Genome Project is the null hypothesis, an essential piece of understanding that would be lovely to have before we all fight about the interpretation of ENCODE data on genomes. For random DNA (not transposon-derived DNA, not coding, not regulatory), what’s our null expectation for all these “functional” ENCODE features, by chance alone, in random DNA?

(Hat tip to The Finch and Pea blog, a great blog that I hadn’t seen before the last few days, where you’ll find essentially the same idea.)

Evolution works on junk

Even if you did the Random Genome Project and found that a goodly fraction of a totally random DNA sequence was “functional”, transcribed and bound and chromatin-marked, would this somehow diminish your view of the human genome?

Personally, I don’t think we can understand genomes unless we try to recognize all the different noisy, neutral evolutionary processes at work in them. Without “noise” — without a background of specific but nonfunctional transcription, binding, and marking — evolution would have less traction, less de novo material to grab hold of and refine and select, to make it more and more useful. Genomes are made of repurposed sequence, borrowed from whatever happened to be there, including the “junk DNA” of invading transposons.

As Sydney Brenner once said, there’s a difference between junk and garbage; garbage is stuff you throw out, junk is stuff you keep because it just might be useful someday.

Conflict of interest/full disclosure: I was a member of the national advisory council to the NIH National Human Genome Research Institute at the time ENCODE was conceived and planned – so I’m not quite as innocent and disinterested in policy questions of NIH NHGRI big science projects and media engagement strategy as this post may have made it sound.

55 responses

  1. ENCODE waves hit the shore…and scientists respond | The OpenHelix Blog pings back:

    [...] Sean Eddy: ENCODE says what? [...]

  2. Ian Holmes comments:

    Mike Eisen called for a neutral model of genome function, which I think is close to your random genome. (Of course, actually physically realizing the random genome would require some improvements in nucleus injection, not to mention cheaper DNA synthesis. Unless perhaps you allowed a big random insertion in an existing viable genome…)

    There are also numerous models of transposon activity, from which one could imagine building a neutral model of genome *architecture* (c.f. the Michael Lynch book you linked to), at least in theory. Those models are rather parameter-rich, though (as would be Eisen’s neutral model of function). There is currently nothing as simple as Kimura’s neutral model for evolution, or Hubbell’s neutral model of biodiversity (both rather beautiful null hypotheses)

  3. Sean Eddy comments:

    We have a design for the “random genome project” on the drawing board, cheap and do-able today, with only one leeetle teensy experimental wrinkle that might be problematic. But as they say in science fiction, you’re allowed one miracle in any good story — just no more than one. If someone wanted to come spend some time in the lab, say from horrible cold Berkeley to beautiful Washington DC…

  4. ENCODE Coverage Round Up: Press, Blogs, and Tweets pings back:

    [...] ENCODE says what? [...]

  5. Gary Karpen comments:

    I can’t resist ratcheting this up yet another level. About to put the kids to bed so it will be shorter than optimal. Bottom line: y’all only talk about ‘junk’ in the ASSEMBLED human genome, which in fact is only ~70% of the full human genome….the rest is the real poster child for junk….simple satellite repeats in centromeric and pericentromeric regions. Not of course protein coding, but clearly required for genome propagation and function (centromeres for sure, probably nuclear architecture) and behave in amazing ways during evolution (highest mutability, homogenized by molecular drive mechanisms, whatever those are).
    I too found the whole discussion about the ENCODE surprises ridiculous. The above describes one reason ….duh, those of us who have worked on centromeres and heterochromatin and TEs and satellites have for years thought about function differently from those focused on protein coding genes…..but in addition, modENCODE dealt with all of this in papers over a year ago…. of course that was flies and worms and they just don’t matter compared to human tissue culture cells :-)

  6. Tim Meehan comments:

    One thing in particular caught my eye reading this: ‘The “junk DNA” question is about how much DNA has essentially no direct impact on the organism’s phenotype – roughly, what DNA could I remove (if I had the technology) and still get the same organism.’

    I’m an outsider to the field, so I ask these questions earnestly. Couldn’t junk DNA have non-obvious effects? That is, perhaps in terms of its biochemical activity it is for all intents and purposes neutral, but might it provide critical structure to chromosomes? Or in some way affect the dynamics of gene transcription? Though you say transcription would ‘rip right through’, would the time spent transcribing certainly be trivial? It would be surprising to me if you could remove that much mass/length/structure from the genome and not change some aspect of the developmental process. Maybe the changes would be trivial. I suppose if they weren’t, the sense of functional would have to be expanded so as to encompass non-biochemical effects.

    Just a thought. Again, I’m an outsider. Maybe others have already dealt with these questions!

  7. sparc comments:

    It may not be necessary to introduce Mb of random DNA into mammalian cells. To my best knowledge there is at least on D.Melanogaster species that captured a complete Wolbachia genome recently (in geological terms). For a start it may be sufficient to build a complete E.coli genome into a mini gene and to introduce it into some mammalian cell line. My prediction is that by ENCODE’s defiition it will be as functional as the cell’s own DNA.

  8. Los “virus informáticos” del ADN « Francis (th)E mule Science's News pings back:

    [...] en nuestro correo electrónico. Te recomiendo leer Sean Eddy, ”ENCODE says what?,” Cryptogenomicon, September 8th, 2012. Sobre los errores que yo mismo en este blog y gran parte de los medios hemos cometido a la hora de [...]

  9. Sean Eddy comments:

    Gary: yeah, the assembly’s up to 2.9 Gb now (about 90% of the genome) but you’re right. I noticed that ENCODE gives the genome size as 2.9G (the assembly), not 3.1-3.2G (which I think is what it’s really supposed to be), which does probably show something about mindset. But that’s been a common error throughout genomics, starting from the days of “we assembled the whole human genome (cough cough, we mean, the part of the genome that we could assemble)”.

    Sparc: shush. shush. you’re absolutely correct. It’s D. ananassae. You’re giving away lab secrets. And there are plenty of examples of people introducing artificial constructs into cells and seeing the vector backbone “function”. I’d still rather do it with completely random DNA someday, because with the kind of arguments that go in in this field, someone is sure to claim that Wolbachia and E. coli retain a homeopathic memory of their ancestry with animal genomes, so they’re recognized by the Drosophila or human transcriptional apparatus.

    Tim: uh, you don’t sound like an outsider. Yes, “junk” could have all sorts of indirect effects. For example, if you introduce a 10x load of extra DNA that soaks up a bunch of DNA-binding proteins, the cell is going to have to compensate by making more of those proteins; if you then suddenly removed the extra DNA, you can expect to see a big perturbation. And like Gary said above you, some very functional DNA, like telomeres and centromeres, is highly repetitive (Drosophila telomeres are made out of transposons, one of the great examples of co-option). And the timing of transcription does make a big difference, just as you intuit: in Drosophila embryos (and probably elsewhere) there are examples of genes that use their junked-up length as a regulatory mechanism: in early embryos, cells divide so fast that the gene never finishes transcription, but when cell division slows down, the gene has time to finish, and complete mRNA gets expressed. Personally I think it’s all sort of gorgeous, how many ways evolution hacks the system together.

  10. Junk is not same with garage | Secret Lab of a Mad Scientist pings back:

    [...] 지놈/인포매틱스계의 ‘진중권’급 말빨을 자랑하는 Sean Eddy 횽의 Enco… [...]

  11. David Botstein comments:

    Sydney Brenner had an amusing alternative hypothesis for the c-value paradox: that
    the “junk” DNA mightnbe required to maintain the viscosity of the nucleus. like many
    of his jokes, this had a serious thought behind it. he also admonished us to remember
    that what you discard is,called “garbage” but “junk” is what you keep. in this case,
    the transposing sequences keep themselves, so to speak. over the years I have found
    these Brenner sayings useful.

    on a more serious note– I think all this should be sent to nature as rebuttal. I also,
    think it is an object lesson that hyping one’s own work in the hope of impressing
    Congress is very likely to backfire.. we scientists need to strive for objectivity as
    best we can lest we make ourselves no longer credible.

  12. Junk No More? ENCODE and the Human Genome | The Beast, the Bard and the Bot pings back:

    [...] An oversight of how the media dealt with the ENCODE news and often misinterpreted it can be found here. A post that, much more eloquently than I can, elucidates what the project actually says is here. [...]

  13. Shigehiro Kuraku comments:

    I read that Susumu Ohno used the term ‘junk DNA’ for the first time.

  14. Sean Eddy comments:

    I think that’s right, Shigehiro.

  15. John Little comments:

    Fantastic post. Sean, your response to Tim’s comment and comments by Michael Eisen regarding DNA binding proteins and transcription factors interacting randomly, or at least non-functionally, throughout the genome gave me pause…perhaps some of this ‘junk’ DNA, derived over time from transposable elements, etc., has indeed aquired a novel but less obvious function as a giant parking garage, or repository, for DNA binding proteins to stably settle between jobs or prior to function, instead of regerating de novo these proteins every time they are required by their respective cellular processes. Of course this would require an entire new level of regulation – shuttling proteins to and fro the job site. Given this hypothesis, it seems that removing large amounts of this ‘junk’ DNA would have deleterious effects on the cell. So my question is, is there any lab out there designing the experiments or technology to remove significant amounts of this putative ‘junk’, to ultimately test the function of all this extra DNA?

  16. On The Neutral Sequence Fallacy « I wish you'd made me angry earlier pings back:

    [...] of function and evolution (an example of such a null model is embodied in Sean Eddy’s Random Genome Project). If not, I fear this conflation of concepts, like Birney’s semantic switch, will lead to [...]

  17. Mike White comments:

    As DNA synthesis costs come down, the Random Genome Project should become feasible…

    As it is, synthesis is cheap enough that I was able to recently synthesize 84 kb of random DNA to serve as my baseline, control distribution in a high-throughput enhancer assay. The assay is plasmid-based, and so admittedly done in an artificial context, but the results are striking – 1) it’s easy to see activity from random DNA, and 2) many classes of genomic sites that look like they should be functional don’t behave differently from random DNA.

  18. Mike Klymkowsky comments:

    Excellant post and great discussion.

  19. Paul Gardner comments:

    Regarding the comments on the unassembled portion of the genome. Do we have the ribosomal RNA genes assembled yet? This paper by Stults et al. suggests we don’t. So, just for interest I ran a BLAT against the human genome using the UCSC genome browser for our 5S rRNA, SSU rRNA and LSU rRNA (including 5.8S rRNA) genes. The top ranked regions for each were chr1:228,766,136-228,766,255, chrUn_gl000220:109078-110946 and chrUn_gl000220:112024-118417.

    What I find interesting is that if we believe UCSC’s conservation track then one of the most conserved set of genes on Earth is not. If we believe the expression data, then one of the most abundant transcripts in any cell is not. As for the rest of the data, I think we can make a pretty good case that the ribosome is “biochemically functional” and therefore should be including much of chromosomes 1q42, 13p12, 14p12, 15p12, 21p12, and 22p12 in these calculations.

    I realise the rRNAs are an enormous pain to deal with. I ran Rfam for four years and the ribosomal RNA families broke just about every bioinformatic pipeline I ever wrote. However, I think the ribosome deserves better.

  20. Sean Eddy comments:

    Yup, I agree, the big rRNA tandem arrays are not in the assembly yet, last we (Tom Jones or I) checked.

  21. Tim Meehan comments:

    @John Little: I think you asked the critical question I was trying to get at. And you gave an interesting possibility for a ‘function’ to junk DNA that’s of a different nature than that of non-junk DNA. I think if the null were to be rejected in the experiment you proposed that would force a reconceptualization of what functional can mean with respect to regions of the genome.

    @Sean: Thanks for the response. I agree, it’s incredibly gorgeous the way it’s cobbled together. And I am an outsider relative to the level of expertise here. My home base is in neuroscience. This discussion has definitely grabbed my interest though :)

  22. D. Allan Drummond comments:

    Terrific. I too cringe at the “functional” definition. A better definition is “a change whose fitness effect is detectable by natural selection”. Set aside for the moment our inability to measure fitness effects; this is true, unfortunate, and irrelevant to the question of whether the definition is correct. The crucial point is that many biochemically measurable changes will fail this challenge in a way that is meaningful: such changes are effectively neutral and are indistinguishable, in evolutionary terms, from no change at all. (This is not a novel thought, and I am basically channeling Lynch here.)

    One can envision a research program which attempts to ascertain which measurable changes have evolutionary consequences. As far as I know, little is being done on this front. The Random Genome Project is an extreme but highly informative take on a quasi-inverse point: absent selection, what measurable features arise? The point is that the RGP, as a kind of null hypothesis, is specifically a null in the case of zero selection — not the case of zero function. Creating a genome with zero function might require effort!

    Selectability is not the only, nor the best, definition. A change might wreak havoc on the on an organism but, by virtue of a tiny effective population size, be undetectable by selection, something quite unsatisfying to a biochemist. Still, it provides a way to cull the many measurable-but-”irrelevant” changes by providing a meaningful framework for evaluating relevance. Some such framework is clearly needed. What other framework is there?

  23. Feng comments:

    The quote from Sydney above that “the “junk” DNA might be required to maintain the viscosity of the nucleus” is really interesting. In a similar vein, we have done some studies, which show that some DNA sequences in the genome may just be “filler” sequences to keep adjacent functional ones from doing too much.

  24. Sean Eddy comments:

    Allan: you’ve already foreseen phase II of the Random Genome Project. The random genome would produce transcripts (“genes”), and my bet is that if we applied standard experimental techniques to them — mutate them, knock them out by RNAi, overproduce them on constructs, look for perturbations of other (real) gene expression levels — we would see reproducible and significant phenotypes (albeit marginal ones, of the sort that we see all the time in reverse genetics studies). I completely agree, I don’t think a random chromosome would be “functionless” in the system. I think this is part of the RGP’s value as a null hypothesis.

    Feng: I agree, and I think that’s part of the slipperiness of the term “function”, and why the term “junk” is only a colloquialism. The junk on my desk is junk, but if you suddenly removed it, my coffee cup would fall over and spill into my laptop; the junk has become part of the system.

  25. A brief history of rubbish » Polypompholyx pings back:

    [...] think this is a headline-baiting and flawed analysis (and I’m certainly … not … alone), but the argument is much more interesting than what one trumped-up [...]

  26. Random Genome, Naked Genome | The Finch and Pea pings back:

    [...] Saturday, my former Center for Genome Sciences colleague Sean Eddy brought up the idea of a Random Genome Project: let’s create a random genome to serve as a null model of genome function. With this random [...]

  27. Arian Smit comments:

    Great post Sean, I had the same reaction and I couldn’t have written it better myself.
    Susumu Ohno’s office was next to my room when I wrote my dissertation. By that time (early 90s) he wasn’t as sharp anymore, but I seem to remember himself explaining the difference between junk and garbage, which now seems to be credited to Brenner.

  28. ENCODE; a beachcomber’s guide to the genome | Eagle Genomics pings back:

    [...] (noun) in a biological sense (implying some wider purpose). For a scientific context see Sean Eddy's excellent post on the subject. This distinction between 'functional (v)' and 'functional (n)' [...]

  29. ENCODE, junk DNA and creationists | Wonderful Life pings back:

    [...] the death of junk DNA (The ENCODE media hype machine).  Over at Cryptogenomicon, Sean Eddy has ENCODE says what? outlining in considerable detail exactly what’s wrong with the claims that 80%  (or even [...]

  30. An ode to junk | The Finch and Pea pings back:

    [...] the undiscovered country. The complete human genome sequence is incomplete. Around 10% of our genome stubbornly refuses to be assembled, and it’s because it is highly repetitive sequences, like all that other [...]

  31. nr comments:

    To the people discussing a structural role for junk DNA: in that context it bears mentioning that in terms of the the amount of hereditary information passed on, “spacer” DNA would have to be pretty small potatoes in comparison to sequence-dependent DNA. The latter represents on the order of 2*n bits for an n-base sequence, the former on the order of log_2(n) bits. It takes 4096 bits to specify a specific sequence of 2048 bases, but just specifying a length of 2048 requires only 12 bits. (Well, you might say you could need more since other sequences might run much longer, but it still only takes 32 bits to specify the length of any sequence shorter than the genome itself.)

  32. Video Tip of the Week: ENCODE enables smaller science | The OpenHelix Blog pings back:

    [...] When the ENCODE consortium publications were released last week, a media blitzkrieg ensued. Soon after, there was a backlash by scientists based on some of the claims that they were seeing made. Some of the issues were due to flawed representations in the press that were legitimate targets of the scientists. Some of the attacks on the science writers were unfair. Some folks had issues with the publication process. Some pushback on the “big science” structure and funding arose. Another thread of discussion was about some of the global claims by the ENCODE team—largely about the parsing of the term “functional”. But this parsing discussion was actually quite informative and useful—the good kind of “inside baseball” that goes on among scientists. Although to people outside the field it may be misunderstood, that’s the way we challenge each other and it’s not personal—it’s about the data. It was like watching a huge world-wide lab meeting take place over a few days via twitter and blogs, and it was really pretty cool. (My favorite take on that drama so far was Sean Eddy’s piece: ENCODE says what?) [...]

  33. Emo Zhao comments:

    I don’t think a random genome is a good null model for ENCODE: random genome will tell you something about functional sequences, but not in the right context. ENCODE is trying to figure out which sites in the human genome is functional, not which sequences are functional. For example, the consensus sequence of a transcription factor binding site could be functional near the transcription start site, but non-functional in the middle of a gene desert. In other words, the function of a DNA element is dependent on its genomic context, which will be completely destroyed in a random genome.

    My biggest problem with ENCODE is that they attempt to find regulatory elements by doing experiments under only one condition. Since regulation is only necessary in varying environments, it seems like they would be more likely to find functional regulatory elements if they performed their measurements in multiple conditions.

  34. Links 9/12/12 | Mike the Mad Biologist pings back:

    [...] ENCODE says what? Why I published a paper on my blog instead of a journal Tetramorium pulcherrimum Limiting citations is unscholarly journal practice Why citations shouldn’t be limited by journals [...]

  35. To do: read ENCODE papers | EuroEPINOMICS pings back:

    [...] not changed that view although describing that 80% are biologically active. Sean Eddy has nicely summarized this and describes a thought experiment of performing ENCODE on a random genome: Undoubtably, much [...]

  36. Joe Felsenstein comments:

    Sean, great discussion here. I am pleased to see so many heavy-hitters agreeing with you here. The ENCODE publicity has now persuaded the public that Scientists Have Shown That There Is No Junk DNA. Anyone doubting that this is what the science press is saying should consider:
    this useful list of links to their stories assembled by Ryan Gregory. It’s going to take a long time to unconvince the public.

  37. Phil Green comments:

    Hi Sean. Great job! A modest suggestion: I think we need to update Sydney’s (or Susumu’s) ‘junk / garbage’ terminology, which seems a bit outmoded in our sustainability-oriented era. Here in green Seattle, we have a multiplicity of terms for household waste (also for rain, as you might imagine), and the different types have different fates. There is ‘garbage’, which goes to a landfill. There are ‘recyclables’, which get recycled. And there is ‘yard waste’, which includes all sorts of compostable, biodegradable stuff (including food & leaves) and which goes somewhere to molder and slowly decay. I submit that the ENCODE folks are in fact correct: the genome is not mostly ‘junk’, it is instead mostly ‘yard waste’ (perhaps with a few recyclables thrown in).
    One other comment: although often the press deserves much of the blame for miscommunicating scientific findings, in this particular case it should all go to the ENCODE project scientists who talked to them. I won’t name names, since many of them are my friends, but in these news articles there are some appalling quotes from people who should know better. My perhaps overly cynical suspicion is that they have an undeclared financial interest in claiming most of the genome has (unknown) functions, since that means more research money to figure it all out. But one consequence could be to make the public think genome scientists are clueless. We don’t exactly look competent when we confidently say first ‘junk’, then ‘oops, no junk’, particularly when we were right the first time.

  38. The Sorry State of Science Writing pings back:

    [...] pulling on those threads, they lead me to computational biologist Sean Eddy’s “ENCODE says what?” post at Cryptogenomicon, the 4,900-word, “My own thoughts” post that Ewan Birney, the lead scientist [...]

  39. Diego comments:

    First of all I would like to express my interest for having access to this type of discussion where formed opinions are over politics. I am not so well-formed in the subject so I would like to make a question and I hope not to sound too naïve with it.
    What are the implications of recognizing the active role of this part of DNA in the whole metabolism of cells for the actual concept of Genetic design (GMO´s)? Could the outcome of this research be interpreted as an indication of that manipulating and redesigning DNA sequences (with the actual state of knowledge) could trigger malfunction in a very sophisticated “switchboard”?

  40. Claudiu Bandea comments:

    Five reasons why my theory on the function of ‘junk DNA’ is better than theirs

    I intend to submit the paper below for publication in a peer-reviewed journal. Before submitting it, and have it reviewed by a handful (if that) of peers, I decided to post it here on the Blogosphere Preprint Server, which is rapidly becoming the front-line platform for transparent and comprehensive evaluation of scientific contributions.

    The ENCODE project has produced high quality and valuable data. There is no question about that. And, the micro-interpretation of the data has been of equal status. The problem is with the macro-interpretation of the results, which some consider to be the most important part of the scientific process. Apparently, the leaders of the ENCODE project agreed with this criterion, as they came out with one of the most startling biological paradigm since, well, since the Human Genome Project has shown that the DNA sequences coding for proteins and functional RNA, including those having well defined regulatory functions (e.g. promoters, enhancers), comprise less than 2% of the human genome.

    According to ENCODE’s ‘big science’ conclusion, at least 80% of the human genome is functional. This includes much of the DNA that has been previously classified as ‘junk DNA’ (jDNA). A metaphorically presented, in both scientific and lay media, ENCODE’s results means the death of the jDNA.

    However the eulogy of jDNA (all of it) was written more than two decades ago, when I proposed (and conceptually proven) that ‘jDNA’ functions as a sink for the integration of proviruses, transposons and other inserting elements, thereby protecting functional DNA (fDNA) from inactivation or alteration of its expression (see a copy of my paper posted here: http://sandwalk.blogspot.com/2012/06/tributre-to-stephen-jay-gould.html; also, see a recent comment in Science, that I posted at Sandwalk: http://sandwalk.blogspot.com/2012/09/science-writes-eulogy-for-junk-dna.html ).

    So, how does ENCODE theory stack ‘mano-a-mano’ with my theory? Here are five reasons why mine is superior:

    #5. In order to label 80% of the human genome functional, ENCODE changed the definition of ‘functional’; apparently, 80% of the human genome is ‘biochemically’ functional, which from a biological perspective might be meaningless. My model on the function of jDNA is founded on the fact that DNA can serve not only as an information molecule, a function that is based on its sequence, but also as a ‘structural’ molecule, a function that is not (necessarily) based on its sequence, but on its bare or bulk presence in the genome.

    #4. Surprisingly, ENCODE theory is not explicitly immersed in one of the fundamental tenets of modern biology: Nothing in biology makes sense except in the light of evolution. Indeed, there is no talk about how jDNA (which contain approximately 50% transposon and viral sequences) originated and survived evolutionarily. On the contrary, my model is totally embedded and built on evolutionary principles.

    #3. One of the major objectives of the ENCODE project was to help connect the human genome with health and diseases. Labeling 80% of these sequences ‘biochemically functional’ might create the aura that these sequences contain genetic elements that have not yet been mapped out by the myriad of genome wide studies; well, that remains to be seen. In the context of my model, the protective function of jDNA, particularly in somatic cells, is vital for preventing neoplastic transformations, or cancer; therefore, a better understanding of this function might have significant biomedical applications. Interestingly, this major tenet of my model can be experimentally addressed: e.g. transgenic mice carrying DNA sequences homologous to infectious retro-viruses, such as murine leukemia viruses (MuLV), might be more resistant to cancer induced by experimental MuLV infections as compared to controls.

    #2. The ENCODE theory is a culmination of a 250 million US dollars project. Mine, zilch; well, that’s not true, my model is based on decades of remarkable scientific work by thousands and thousands of scientists who paved the road for it.

    #1. The ENCODE theory has not passed yet the famous Onion Test ( http://www.genomicron.evolverzone.com/2007/04/onion-test/), which asks: why do onions have a genome much larger than us, the humans? Do we live in an undercover onion world? The Onion Test is so formidable and inconvenient that, to my knowledge, it has yet to make it through the peer review into the conventional scientific literature or textbooks. So, does my model pass the Onion Test? I think it does, but for a while, I’m going to let you try to figure it out how! And, maybe, when I’m going to submit my paper for publication, I’ll use your ideas, if the reviewers will ever ask me for an answer. Isn’t that smart?

  41. Astrogator’s Logs » Blog Archive » Junk DNA, Junky PR pings back:

    [...] S.  Encode says what? (Cryptogenomicon, Sept. 8, [...]

  42. Los “virus informáticos” del ADN | Actualidad informática pings back:

    [...] lo hace en nuestro correo electrónico. Te recomiendo leer Sean Eddy, ”ENCODE says what?,” Cryptogenomicon, September 8th, 2012. Sobre los errores que yo mismo en este blog y gran parte de los medios hemos cometido a la hora de [...]

  43. “The Designer’s Detritus” – my latest Nature Education post on ENCODE, junk DNA and intelligent design pings back:

    [...] due to the noise and imprecision inherent in various biological processes like transcription. Somecritics have even gone so far as to propose the Random Genome Project, a null hypothesis test for [...]

  44. ENCODE husmea en mi armario - En menos de mil palabras pings back:

    [...] artículos del gran Ed Yong y de Sean Eddy sobre el hype generado a causa de la publicación de los resultados del proyecto ENCODE han [...]

  45. Max Libbrecht comments:

    Thanks for writing this! I’ve been directing people that ask me about ENCODE’s 80% number to this post, and it’s great to have such a clear reference available.

  46. Sean Eddy comments:

    Thanks, Max. There’s now a version of this post coming out in Current Biology in a few weeks. A preprint is on the lab’s preprint server.

  47. Thursday linkage | Wonderful Life pings back:

    [...] project had to redefine ‘function’ to get the 80% figure. It’s worth reading ENCODE says what? at the Cryptogenomicon blog – written by labs who really know what they are talking [...]

  48. Claudiu Bandea comments:

    In my parodic comment above, ”Five reasons why my theory on the function of ‘junk DNA’ is better than theirs”, I brought forward an old model (1) on the genome evolution and on the origin and function of the genomic sequences labeled ‘junk DNA’ (jDNA), which in some species represents up to 99% of the genome.

    Since then, I posted in Science five mini-essays outlining some of the key tenets associated with this model, which might solve the C-value and jDNA enigmas ( http://comments.sciencemag.org/content/10.1126/science.337.6099.1159).

    As discussed in the original paper (1) and these mini-essays, the so called jDNA serves as a defense mechanism against insertional mutagenesis, which in humans and many other multicellular species can lead to cancer.

    Expectedly, as an adaptive defense mechanism, the amount of protective DNA varies from one species to another based on the insertional mutagenesis activity and the evolutionary constrains on genome size.

    1. Bandea CI. A protective function for noncoding, or secondary DNA. Med. Hypoth., 31:33-4. 1990.

  49. David Konerding comments:

    Sean, I read the pre-print at:

    Please be aware that my criticism originates in a genuine desire to understand the genome- that is my life’s work. I believe that we have a long way to go to understand the genome, and the technology we have today to interrogate the genome is far from capable to explain some of the “apparently” paradoxical genome size question.

    Your preprint’s primary use is in clearly stating the assumptions and reasoning you made. And that’s what’s important here: being explicit about the assumptions and reasoning you made makes it a lot easier to argue about the underlying facts. For example you make the assumption “Selfish DNA
    elements function for themselves, rather than having an adaptive function for their host.” (although you allude to a far more subtle interplay between organismal transposition and its abuse by self-replicating entities).

    The problem is that all you or ENCODE is doing when debating the 80% figure is arguing about definitions, which isn’t particularly interesting. What would be more interesting is to look at the ENCODE data in detail, and understand the WHY of what they observe. If we can explain it with null hypotheses, rather than overreaching conclusions, that’s a good thing. If instead the new data helps us understand something vexing, that’s a *great* thing.

    Ultimately, i think we can resolve the entire debate by having ENCODE:
    1) admit that their function definition is very loose
    2) admit some of their claims are overreaching
    3) spend a lot more time coming up with significantly more sensitive and accurate methods to determine actual “functionalism” in genomic DNA.

    and having ENCODE’s detractors:
    1) spend a lot more time looking at the ENCODE data
    2) trying to disprove some of their own beliefs and assumptions. I have a hunch that ENCODE’s data is telling us something,

  50. Jack comments:

    I read in a book that 97% of our DNA is junk. So this is kind of stunning. I might actually write about this on my blog…

  51. CC comments:

    My undergrad molecular genetics is really rusty, my ecological knowledge less so. Even for someone such as myself this genetic stuff is quite complex, and I imagine for the layperson, somewhat unfathomable. I remember from my studies that junk DNA existed, but in those days new things were being discovered. I have been hoping by now some resolution of debates may have occurred, but if anything it seems to have got worse! From my vantage point, natural selection mainly (although not always) produces things that are functional.
    Thus I rather like the idea of junk DNA being a pillow, a sort of protective entity as described by Claudiu Bandea.
    However there is a possibility that it is relic DNA “left over” from ancestral selection processes. You would then expect an onion to have more of it than us, as we would hopefully be more refined! It may be left over and not harmful, but it may even be preferentially maintained if not conserved due to providing such a “pillow” effect.
    From my point of view, what I really need you guys to be all doing is focussing on regulatory genes and other control processes. I am not interested in what has basic biochemical function such as junk DNA, but what has possible translation to the outside world i.e. can influence ecological responses. Can we please stop the squabble and concentrate on what Darwinian selection actually provides at the level of the genome and proteins?

  52. I have a headache reading about ENCODE: moving into the realm of “big science” | Science, I Choose You! pings back:

    [...] ENCODE says what? by Sean Eddy [...]

  53. matt comments:

    @Sean Eddy: I don’t fully understand, but like other posters I admit this is not my field, so I’m still trying to sort out how this plays out. You say transposons are the poster child for junk DNA, but then mention that Drosophila telomeres are made out of transposons. Are those telomere transposons junk or functional? You seem happy for them to be both, which seems at odds with the definitions of junk and functional. Not ENCODE’s definition of function, but as I understand Comings/Ohno evolutionary fitness terms–surely selection clearly prefers a telomere?

    If they are functional, and not junk, then don’t you need more numbers to justify why you lumped in the entire amount of transposable elements as junk, say, numbers to suggest which fraction of transposable elements have not been co-opted for evolutionary fitness and which have? The OP says “Indeed, we are now finding many interesting examples of transposon-derived stuff being co-opted for organismal function (but these are the exception, not the rule).” The last clause makes a definitive statement (upper bound on prevalence), while the opening clause suggests the number is increasing: is there data to suggest an upper bound? Are you saying “these are currently the exception, not the rule, and it’s my belief this won’t change”?

    The OP says “The ‘junk DNA’ question is about how much DNA has essentially no direct impact on the organism’s phenotype – roughly, what DNA could I remove (if I had the technology) and still get the same organism.” That, too, is imprecise: is a human with HER2 mutation the same phenotype as one without? If you removed introns completely, would RNA splicing work? What if it worked only half the time, is that a different phenotype? What if the intron required a certain minimum number of base pairs for splicing to work, which part of the intron is functional and which is not, and could the standard measures of selective pressure identify function? If you removed everything in a 5′ UTR except the exact regulatory sequence, would that work, or is the “regulatory sequence” a sequence-specific part plus a non-sequence-specific spacing to allow a 3D molecule to bind a certain sequence and still have room to accomodate the rest of its structure? Which is functional, and can you measure the selective pressure for that? Is it about REMOVING that DNA, or is it in practice about which stretches of DNA repress point mutations, and thus are non-redundant and sequence-specific?

    Wandering further off, if you eliminated the introns and the spliceosome, such that the same transcript got generated, is that the same phenotype and organism? Or have you created a new, intron-and-spliceosome-free organism? If that organism proved horribly more susceptible to attack/disabling by transposable elements, such that it died out, would that constitute proof that those parts of DNA were not junk but selectively functional? What if it didn’t die out, but became obviously different? If there were multiple splicings, and in your non-splicing organism you duplicated each as a separate gene, does that extra DNA constitute proof that the non-splicing organism doesn’t need those additional base pairs (it’s a bunch of extra DNA in virtually identical organisms)?

    You wrote “I agree, and I think that’s part of the slipperiness of the term ‘function’, and why the term ‘junk’ is only a colloquialism. The junk on my desk is junk, but if you suddenly removed it, my coffee cup would fall over and spill into my laptop; the junk has become part of the system.” Wait…are you saying that spacer DNA has a secondary function, but not a selective one? Or are you acknowledging it can have a selective function, but would still consider it “junk”? If the spacer DNA has a selective function–let’s say complete removal of the spacer ultimately results in death of the fetus–is that identifiable and measurable right now? How is it classified according to the numbers you’ve given? It might even have a transposable element in the middle.

    D. Allan Drummond writes, “A better definition is ‘a change whose fitness effect is detectable by natural selection’. Set aside for the moment our inability to measure fitness effects; this is true, unfortunate, and irrelevant to the question of whether the definition is correct.” This is on a post which is essentially ranting about ENCODE’s numbers for functioning DNA, so if you concede you can’t actually measure your correctly defined “functional” entities, why are you quibbling about somebody else’s numbers? If your hugely conservative estimator function for your “correct definition” is known to underpredict, and they’ve chosen an optimistic “incorrect definition” that acts as a liberal estimator for your definition, why are you quibbling? They can point out DNA which your estimator wrongly excludes, and you can point out DNA their definition wrongly includes.

    nr comments about how little information it would take to encode spacing rather than sequence. If I understand his point, the density of information encoded by a base pair isn’t relevant when deciding whether it is under selective pressure or not. True, in information-theoretic terms it is wasteful to use say 100 base pairs to convey a binary message “yes, transcribe this DNA” or “no, do not transcribe”. But it isn’t _coding_ for a spacer, it _is_ the spacer. It acts in sort of an epistatic way.

    Lastly, in a more philosophic vein, I think the whole wasted time around “junk DNA” springs from using inappropriate terminology. When you use words in a substantially different manner than society at large, it should come as no surprise there will be much confusion and you will have to explain your precise meaning over and over. And you really don’t have much of a leg to stand on if some of your colleagues decide to use the word in a manner more closely fitting general useage.

    In particular, Brenning (and Ohno too?) differentiate between junk you keep, and garbage you throw away. Here are relevant meanings from a dictionary description:

    1. Discarded material, such as glass, rags, paper, or metal, some of which may be reused in some form.
    2. Informal
    a. Articles that are worn-out or fit to be discarded: broken furniture and other junk in the attic.
    b. Cheap or shoddy material.
    c. Something meaningless, fatuous, or unbelievable: nothing but junk in the annual report.
    tr.v. junked, junk·ing, junks
    To discard as useless or sell to be reused as parts; scrap.
    1. Cheap, shoddy, or worthless: junk jewelry.

    In particular, I would highlight the repeated theme that junk should be thrown out, has been thrown out, or is being thrown out. And yet, “junk DNA” is widely acknowledged to have all sorts of benefits and uses, it just is not currently under selective pressure, and is presumed to be separate from “garbage DNA” which more closely corresponds to the given definition for junk.
    It’s no wonder there’s confusion and scorn when conveyed to the general English-speaking public, and as with the musician Prince/artist-formerly-known-as, explaining the special meaning for the characters doesn’t completely fix things.

  54. Sean Eddy comments:

    Matt, I’ve started to notice that people who think that this is a debate about the semantics of the words “junk” and “function” tend not to talk about the data and observations that led to Ohno’s concept of junk DNA, and instead tend to argue from their intuition about how they think genomes should work. It is indeed complicated, for some of the reasons you discuss. You mentioned it’s not your field; if you’re interested in why someone might think the way I do, there’s good books on the subject. I recommend Michael Lynch’s book The Origins of Genome Architecture.

    But what you really ought to do is read about transposons, which are super cool; and once you see how transposons work, I think you’ll see why we don’t have to imagine that they all have advantageous functions for us, any more than we imagine that the cold or flu viruses are advantageous to us. For the most part we have them because we can’t get rid of them; they’re ‘alive’ for themselves, not for us.

  55. Clifton comments:

    There is always a rebuttal to every argument isn’t there? People believe what they want to believe. What this article is really doing is providing a response to project results that threaten one’s belief in evolution and support intelligent design.

Leave a comment