Friday, September 25, 2009

More simulated uptake

Thanks to Rosie eliminating the int function from her Perl model, I got to take a look at some more simulated uptake data. Last time, there were several issues, which now seem solved. This time to model uptake, she used the real genomic USS position weight matrix to stochastically select 500 bp fragments from the first 50 kb of the Haemophilus genome. I got 200 from the forward strand and 200 from the reverse complement strand, along with a set of random fragments. This is 4X spanning coverage of the 50 kb…

Below is the way the data looked in the UCSC genome browser, added as custom tracks (click on the figures to enlarge).
From the top, the tracks are:
(1) Chromosome position
(2) 400 random fragments (shades of brown).
(3) 400 selected fragments (shades of blue).
(4) Positions of “perfect core” USS motifs (5’AAGTGCGGT-3’) on either strand
(5) RefSeq gene annotations.

Here’s a bit of the 50 kb zoomed in:
That looks pretty good for such low coverage! (In our real experiment, we expect to get several hundred times more data.) It’s starting to look like a real model of how uptake might look! The random fragments look roughly randomly distributed, and the selected fragments clearly show a punctate distribution around the “perfect core” USS sites.

Here’s a histogram of the scores of the best site on a given fragment for the random and selected datasets:
Indeed, the distributions are quite distinct, though notably the distribution of random fragments looks bimodal. This may simply be a feature of the genome, since there are so many USS sites… Worth thinking about though.

There are other details obscured in the browser figures: the shading indicates the relative score of the best site on the fragment (on a log scale), and each fragment also has orientation shown as a small arrowhead within the box. I’ve also associated each fragment with its score. So in the browser, I can easily check things out more carefully:
In this zoom, it’s clear that there is an excellent site to the left (just under 18,000), a weaker site to its right (~18,400; fragments are overlapping with the left-most site; no perfect core), and a pair of sites on different strands with different scores to the right (~19250). I can also retrieve the sequence associated with a given fragment to see if I can spot the USS site within it. And if I really zoom in, the DNA sequence is listed at the top.

It’s not really so easy to see what’s going on with all these overlapping fragments, so my next task will be to convert this data from fragment positions to spanning coverage per chromosomal position (though I’ll probably bin positions, perhaps every 100 bp to keep things reasonably small for now). I will take a stab at doing this properly (with a script) but may wimp out and do it in Excel. If I can then muscle this data into a WIG formatted file, then I’ll be able to plot the data in a way where “good” sites will look like peaks in coverage…

No comments:

Post a Comment