Tuesday, June 9, 2009


Howzabout that “supragenome”?

Polymorphic gene content

Genome sequencing efforts have captured substantial variation in the gene content among closely related bacteria. For example, this nice study by Hogg et al. 2007 compared the genome sequences of thirteen Haemophilus influenzae: They found a “core genome” of ~1500 genes, along with another accessory (or contingency) genome of ~1300 genes. Any given isolate had a subset of the accessory genome, numbering around a few hundred extra genes beyond the core in each isolate.

So any two isolates have substantial amounts of DNA that is unshared. For example, Hogg et al. report that the Kw20/Rd and 86-028NP isolates differ by nearly 400,000 bp within only ~250 indels (Table 5-- mean size: ~1.5 kb, median size: ~300 bp). The genomes are < 2 Megabases long, so that’s about 20% of these chromosomes that is non-homologous.

(As an aside, the Hogg et al. paper’s methods section introduced me to MUMmer, which I previously discussed. The indels and other rearrangments are essentially defined by breaks in the alignment produced by the nucmer utility. More on this in the future...)

There are several possible arguments related to uptake specificity and the "supragenome":

Uptake specificity for variation?

Large variation in orthologous gene content suggests to Hogg et al. and others a “distributed genome hypothesis”, in which natural transformation can shuffle (or re-assort, or segregate) the accessory genes between isolates. This would then presumably allow for the rapid acquisition and loss of different genes (diversification) from within a given genetic background and thus perhaps rapid adaptation to environmental changes (or shifting host defenses). The “distributed genome hypothesis” is then implicitly related to the “sex hypothesis” for the maintenance of natural competence.

Uptake specificity for conservation?

On the other hand, natural transformation could also maintain the “core genome”. Thus, if there is plenty of conspecific DNA uptake, any bit of “core genome” taken up could be replaced in a cell that had lost it. This nice study by Treangen et al. 2008 using several neisserial genome sequences showed that DNA uptake sequences (DUS, the neisserial equivalent of USS) existed at a higher density in “core” regions of the genome than in the substantial alignment gaps between isolates (containing indel poymorphism).

Again, there is some indication of the “sex hypothesis” for the maintenance of natural competence, but it works in the opposite direction, maintaining the core rather than shuffling the accessory. I think the argument goes like this: (1) The core genome likely defines the more essential portions of the genome, since by definition any accessory genes are not required to live. (2) DUS could have been selected for within this partition of the genome, since it would help to maintain the more essential gene functions within a population. (3) Therefore the high number of DUS sequences could be a product of natural selection to maintain the integrity of the “core genome”.

Uptake specificity for no reason in particular?

However, there’s another possibility the authors partially explore that does not involve selection for DUS distributed throughout the genome, but represents almost the opposite model. Instead of selection, pehaps DUS accumulate due to happenstance intrinsic biases in the uptake and/or recombination machinery by a neutral molecular drive. So sequence variants that arise with a higher chance of being taken up later are more likely to spread through populations than variants with a lower chance of uptake. Thus the “core genome” could partially be that way, i.e. conserved across isolates--not exclusively because of essentiality or usefulness--but also by virtue of containing lots of DUS. So rather than DUS being selected for in order to maintain the core genome, segments of DNA containing DUS are simply mre easily replaced in lineages that lost them.

An affiliated idea suggests that if some accessory genes were from distant relatives and arrived by horizontal transfer by some mechanism besides natural competence, these sequences would not have had time to accumulate uptake sequences yet. Thus the paucity of DUS in the accessory genome might be in part due to the more recent arrival of that sequence in the genome, so the effects of drive have not yet become evident, rather than a specific selection pressure to maintain DUS in more important segments of the genome.

(The Treangen et al. paper introduced me to another genome alignment tool called M-GCAT. I’ve played with it a bit and managed to produce some figures effectively the same as what appears in their supplementary data-- the picture above (along with alignment files resembling multi-FASTA format) but have the unfortunate problem of being unable to re-load analyses I’ve performed later due to some kind of Python error. More in this later as well...)

How to analyze the core and accessory genomes myself?

I’ve clearly got a lot more thinking to do regarding these core and accessory genomes... Especially in light of the horizontal gene transfer issue.

But first I’d better figure out simply how to define the core and accessory genomes more specifically.

I’ve begun this by examining the gaps in the .rdiff and .qdiff output of dnadiff (a pairwise comparison of two genomes) to try and do some basic analysis myself. In a future post, I’ll report on my progress with this, but for now, I’ll just mention that most of the gaps are not strictly insertions or deletions, but are rather insertional deletions. Alignment gaps include both reference and query bases. But I still need to try and understand how dnadiff produced its .report output before I can get much further...

1 comment:

  1. If a deletion or rearrangement creates a novel joint that is close to an uptake sequence, would that change be more likely to be maintained?

    Next week we really need to start producing both data and text for our various proposals. By then my backlog should be cleared...