Tuesday, June 16, 2009

They're eating our DNA!!!

I am in the midst of proposal-writing, which makes blogging tougher, but when I hit a writing-block at the end of the day, I decided to dally with a random thing I'd been meaning to figure out.

Several friends of mine, when I tell them about my plans with the naturally competent bacteria, have said, "Dude, you should feed them human DNA!"

Sound silly? Maybe it is, but I went ahead and used TAGSCAN to search the human genome for the 10-bp core USS motif: AAAGTGCGGT and found 762 instances of the USS core. BLAST and BLAT didn't want to deal with me due to how short the query USS motif is.

That means Haemophilus influenzae might find nearly 800 bits of our genomes tasty! I remember from my reading that there's nearly 200 micrograms of DNA per milliliter of our lung mucus, which seems like a heck of a lot. Much of this DNA is probably human...

(For context, on average Haemophilus have a USS less than every 2 kilobases, but humans have an average USS density lower than one in 20 megabases. So while humans have half as many USSs, there's a 10,000 times lower density.)

It isn't surprising the human genome contains at least some "USS". A random 10mer string would have a 1/(4^10), or a little less than 1 / 1,000,000, chance of being the USS motif. But the human genome is more than 3 billion base pairs. So randomly, we might expect to find more than 6000 USS (3 billion / 1 million X 2). (The 2 is to also count the reverse complement.)

If I did that right, then the observed number of USS motifs in the human genome is rather less than expected. That's interesting...

However, that was a crude estimate of the expected number of USS motifs. The USS core sequence has 5 GC bases and 5 AT bases, giving a GC base composition of 50%, whereas the human genome has only 41% GC content. There's also the issue of dinucleotide frequencies. For example, the CpG dinucleotide is underrepresented in the human genome, but happens to appear in the USS. I've been trying to figure out a rational way to incorporate this type of information into my estimate of expected, but so far have failed to do so properly. Regardless, my estimate of expected is certainly too high. The question is: how much so?

To produce control numbers, maybe tomorrow I'll run a few other TAGSCANs for arbitrary 10mers with the same GC content and a CpG and see how many come up. I haven't really looked at the distribution of USS, except that the number of USS per chromosome is highly correlated to chromosome size (R^2 = 0.91). I might also predict that USS will fall into more GC-rich regions of the genome.

But for fun, assuming that the result held, what might it mean if the USS motif is significantly underrepresented in the human genome? I can hardly imagine that Haemophilus could be responsible ("it ate them!!!"), but maybe it could work in the other direction? Perhaps the USS motif is only coincidentally somewhat rare in humans, and as such makes a good sequence to use for preferring conspecific DNA uptake? If the USS motif was extremely abundant in the human genome, then naturally competent Haemophilus might not take up conspecific DNA as easily? Hmmm...

Regardless, think about that when you fall asleep, folks... There's bacteria inside you, and THEY'RE EATING YOUR DNA...

UPDATE:

Putting my proposal together continues, but in a bout of procrastination I did go ahead and run a few random 10mers with the same base composition (and a CG dinucleotide) through TAGSCAN, and it looks like the observed number of USS motifs is about expected for 10mers of similar composition.

The USS motif:
AAAGTGCGGT 762

Randomized USS motifs:
GTACGTAAGG 414
CGAGAGAGTT 761
GAATACGTGG 1112
AGATAGCGTG 714
GTAACGAGTG 458
AGACGTTAGG 713

Nothing to see here... Move along...

1 comment:

  1. But.... isn't the 762 still much lower than the calculated 'expected' number? Your new tests indicate that the discrepancy arises only because this calculation is wrong, so it would be good to find the error(s). I'm now having doubts about that factor of 2, and maybe you should try taking base composition into account. What if you test some sequences that don't have a CpG?

    ReplyDelete