Thursday, June 18, 2009

A taste of DNA

I gave Rosie a draft of my grant application, so decided to dither with human USS motifs again.

First, for fun, I looked at where several of them were located... to see which genes taste best... and ran across USS in all sorts of random genes. (For example: a kinesin, an adductin, a phosphatidic acid phosphatase, a phosphatidic acid kinase(!), a cadherin-associated protein, a few hypothetical genes and transcriptions factors, etc. etc.) There were also several located outside genes and in gene-poor regions. It would be funny to do a GO annotation analysis, but most certainly a waste of time...

But to address Rosie’s comment: I found 762 10mer USS motifs in the human genome using a Tagscan search. This looks like it meets random expectations, since using Tagscan to count arbitrary 10mer motifs having similar base composition (and a CpG) gave similar numbers.

However, my analytical calculation of expected seemed way off. Where did I go wrong? (One reason why I might actually care about this is so that I could do a more precise analytical calculation of the number off USS motifs I’d expect in the Haemophilus genome.)

I had calculated the chance of getting a USS motif for a random 10mer as G* (25%)^10, for a 25% chance of drawing the correct base at each position (where G is genome size = 3.16e9 bases):

So G * (25%)^10 = G * 1 / 1,048,576 = G * 9.54 e -7 = 2,956 instances.

But since the human genome has only 41% GC content, I might adjust this measurement to be (20.5%)^5 * (29.5%)^5, for a 20.5% chance of drawing the correct base at GC positions and a 29.5% chance of drawing the correct base at AT positions:

So G * (20.5%)^5*(29.5%)^5 = G * 8.09 e -7 = 2,556 instances

These are for only a single strand of DNA. Since there’s also the reverse complement, we do have to multiply these values by 2. I can’t think why this wouldn’t be. So...

50% GC : 5922
41% GC : 5112

So even accounting for %GC failed to bring this calculation down to what Tagscan found.

What about dinucleotide composition?

It’s well known that mammalian genomes have a dearth of CpG dincucleotides, since these are used as sites of gene regulation by cytosine DNA methylation. Methylated cytosines tend to deaminate into thymidines, so there is a mutational pressure on CpG to go to TpG dinucleotides.

Anyways, I managed to find a nice table in this paper reporting dinucleotide composition in several genomes. The paper the table’s authors used for humans was dated from 1962, so these are probably not particularly precise numbers but sufficient for my purposes. It looks like the CpG dinucleotide is ~4-fold less than would be expected for a random genome with base composition like humans’.

This paucity of CpG dinucleotides in the human genome could account for the discrepency. So just for giggles, I ran a few additional Tagscans for 10mers with the correct base composition but lacking a CpG dinucleotide, giving the following numbers: 9143, 6698, and 4343 motifs. Those numbers look a lot more on-target (not terribly precise, but more accurate).

But how can I actually use the known distribution of dinucleotides in the human genome to arrive at an even more accurate estimate of how many USS motifs I’d expect to see? I thought up some crummy ways to sorta-kinda account for the CpG deficit, but would ideally like to be able to use arbitrary dinucleotide frequencies and arbitrary %GC to produce an expected value. I have a terrible feeling I need to use Markov or Ising models and just don’t have the heart for it right now...
As an aside, UNIX is pretty awesome. To figure out the number of motifs Tagscan was finding (their output is a list of matches), I simply typed:
> wc -l filename

And that gave me the number of lines in the file. The first row was the header, and the last row was blank, so subtracting 2 from the number gave me the total number of motifs.

If I had Tagscan (and the human genome) on my computer, I could fairly easily set up a script to iterate through a whole bunch of 10mers with specified parameters and draw up a distribution. If I was really cool, I'd exclusively use UNIX commands to do this. This would then allow me to ask what the significance of the number of USSs would be. (Obviously p > 0.05, but that would be one way to do a real statistical test, even if I never figured out how to work out the expected value by an analytical method.)

1 comment:

  1. I read your blog and really it so impressed me. Your style of presentation regading DNA Methylation is very interesting.Thanks for giving more ideas about DNA.