No DNA Control: human USS

Thursday, June 18, 2009

A taste of DNA

I gave Rosie a draft of my grant application, so decided to dither with human USS motifs again.

First, for fun, I looked at where several of them were located... to see which genes taste best... and ran across USS in all sorts of random genes. (For example: a kinesin, an adductin, a phosphatidic acid phosphatase, a phosphatidic acid kinase(!), a cadherin-associated protein, a few hypothetical genes and transcriptions factors, etc. etc.) There were also several located outside genes and in gene-poor regions. It would be funny to do a GO annotation analysis, but most certainly a waste of time...

But to address Rosie’s comment: I found 762 10mer USS motifs in the human genome using a Tagscan search. This looks like it meets random expectations, since using Tagscan to count arbitrary 10mer motifs having similar base composition (and a CpG) gave similar numbers.

However, my analytical calculation of expected seemed way off. Where did I go wrong? (One reason why I might actually care about this is so that I could do a more precise analytical calculation of the number off USS motifs I’d expect in the Haemophilus genome.)

I had calculated the chance of getting a USS motif for a random 10mer as G* (25%)^10, for a 25% chance of drawing the correct base at each position (where G is genome size = 3.16e9 bases):

So G * (25%)^10 = G * 1 / 1,048,576 = G * 9.54 e -7 = 2,956 instances.

But since the human genome has only 41% GC content, I might adjust this measurement to be (20.5%)^5 * (29.5%)^5, for a 20.5% chance of drawing the correct base at GC positions and a 29.5% chance of drawing the correct base at AT positions:

So G * (20.5%)^5*(29.5%)^5 = G * 8.09 e -7 = 2,556 instances

These are for only a single strand of DNA. Since there’s also the reverse complement, we do have to multiply these values by 2. I can’t think why this wouldn’t be. So...

50% GC : 5922
41% GC : 5112

So even accounting for %GC failed to bring this calculation down to what Tagscan found.

What about dinucleotide composition?

It’s well known that mammalian genomes have a dearth of CpG dincucleotides, since these are used as sites of gene regulation by cytosine DNA methylation. Methylated cytosines tend to deaminate into thymidines, so there is a mutational pressure on CpG to go to TpG dinucleotides.

Anyways, I managed to find a nice table in this paper reporting dinucleotide composition in several genomes. The paper the table’s authors used for humans was dated from 1962, so these are probably not particularly precise numbers but sufficient for my purposes. It looks like the CpG dinucleotide is ~4-fold less than would be expected for a random genome with base composition like humans’.

This paucity of CpG dinucleotides in the human genome could account for the discrepency. So just for giggles, I ran a few additional Tagscans for 10mers with the correct base composition but lacking a CpG dinucleotide, giving the following numbers: 9143, 6698, and 4343 motifs. Those numbers look a lot more on-target (not terribly precise, but more accurate).

But how can I actually use the known distribution of dinucleotides in the human genome to arrive at an even more accurate estimate of how many USS motifs I’d expect to see? I thought up some crummy ways to sorta-kinda account for the CpG deficit, but would ideally like to be able to use arbitrary dinucleotide frequencies and arbitrary %GC to produce an expected value. I have a terrible feeling I need to use Markov or Ising models and just don’t have the heart for it right now...
----
As an aside, UNIX is pretty awesome. To figure out the number of motifs Tagscan was finding (their output is a list of matches), I simply typed:

> wc -l filename

And that gave me the number of lines in the file. The first row was the header, and the last row was blank, so subtracting 2 from the number gave me the total number of motifs.

If I had Tagscan (and the human genome) on my computer, I could fairly easily set up a script to iterate through a whole bunch of 10mers with specified parameters and draw up a distribution. If I was really cool, I'd exclusively use UNIX commands to do this. This would then allow me to ask what the significance of the number of USSs would be. (Obviously p > 0.05, but that would be one way to do a real statistical test, even if I never figured out how to work out the expected value by an analytical method.) (continued...)

Tuesday, June 16, 2009

They're eating our DNA!!!

I am in the midst of proposal-writing, which makes blogging tougher, but when I hit a writing-block at the end of the day, I decided to dally with a random thing I'd been meaning to figure out.

Several friends of mine, when I tell them about my plans with the naturally competent bacteria, have said, "Dude, you should feed them human DNA!"

Sound silly? Maybe it is, but I went ahead and used TAGSCAN to search the human genome for the 10-bp core USS motif: AAAGTGCGGT and found 762 instances of the USS core. BLAST and BLAT didn't want to deal with me due to how short the query USS motif is.

That means Haemophilus influenzae might find nearly 800 bits of our genomes tasty! I remember from my reading that there's nearly 200 micrograms of DNA per milliliter of our lung mucus, which seems like a heck of a lot. Much of this DNA is probably human...

(For context, on average Haemophilus have a USS less than every 2 kilobases, but humans have an average USS density lower than one in 20 megabases. So while humans have half as many USSs, there's a 10,000 times lower density.)

It isn't surprising the human genome contains at least some "USS". A random 10mer string would have a 1/(4^10), or a little less than 1 / 1,000,000, chance of being the USS motif. But the human genome is more than 3 billion base pairs. So randomly, we might expect to find more than 6000 USS (3 billion / 1 million X 2). (The 2 is to also count the reverse complement.)

If I did that right, then the observed number of USS motifs in the human genome is rather less than expected. That's interesting...

However, that was a crude estimate of the expected number of USS motifs. The USS core sequence has 5 GC bases and 5 AT bases, giving a GC base composition of 50%, whereas the human genome has only 41% GC content. There's also the issue of dinucleotide frequencies. For example, the CpG dinucleotide is underrepresented in the human genome, but happens to appear in the USS. I've been trying to figure out a rational way to incorporate this type of information into my estimate of expected, but so far have failed to do so properly. Regardless, my estimate of expected is certainly too high. The question is: how much so?

To produce control numbers, maybe tomorrow I'll run a few other TAGSCANs for arbitrary 10mers with the same GC content and a CpG and see how many come up. I haven't really looked at the distribution of USS, except that the number of USS per chromosome is highly correlated to chromosome size (R^2 = 0.91). I might also predict that USS will fall into more GC-rich regions of the genome.

But for fun, assuming that the result held, what might it mean if the USS motif is significantly underrepresented in the human genome? I can hardly imagine that Haemophilus could be responsible ("it ate them!!!"), but maybe it could work in the other direction? Perhaps the USS motif is only coincidentally somewhat rare in humans, and as such makes a good sequence to use for preferring conspecific DNA uptake? If the USS motif was extremely abundant in the human genome, then naturally competent Haemophilus might not take up conspecific DNA as easily? Hmmm...

Regardless, think about that when you fall asleep, folks... There's bacteria inside you, and THEY'RE EATING YOUR DNA...

UPDATE:

Putting my proposal together continues, but in a bout of procrastination I did go ahead and run a few random 10mers with the same base composition (and a CG dinucleotide) through TAGSCAN, and it looks like the observed number of USS motifs is about expected for 10mers of similar composition.

The USS motif:
AAAGTGCGGT 762

Randomized USS motifs:
GTACGTAAGG 414
CGAGAGAGTT 761
GAATACGTGG 1112
AGATAGCGTG 714
GTAACGAGTG 458
AGACGTTAGG 713

Nothing to see here... Move along...

(continued...)

No DNA Control

Thursday, June 18, 2009

A taste of DNA

Tuesday, June 16, 2009

They're eating our DNA!!!

About Me

The Transforming Principle

Related Links

Semi-related Links

Blog Archive