Wednesday, September 23, 2009

Fake Periplasmic Data

UCSC Microbial Genomes Database was a nice find for me, since they host the Haemophilus influenzae KW20 genome. It has pretty much made me forget about my own plans to make a custom browser for the moment. Though that will change, as when we have our own data, we’ll absolutely need some off-line way to browse our datasets, since they will be so large…

As a first fake experiment to explore how our periplasmic DNA pools might look, Rosie sent me two sets of 200 sequences. One set was 200 randomly chosen 100mers from the first 10 kb of the Haemophilus genome, and the other set were 200 sequences (100mers) stochastically selected for the presence of a USS using her Perl scripts. All I had to do was turn her data into a BED formatted file, which only took a few minutes. As usual, I made the BED file using Microsoft Office, rather than a more savvy command-line way, which would've probably used Grep or something.

Here’s what her data looks like plotted as a custom track (squished) in the UCSC genome browser:

RANDOM
SELECTED
It looks like it sort of worked! There’s a prominent peak containing nearly half the sequences in the selected pool, while the random fragments look just like they ought to.

One issue here is that we know there are two other perfect matches to the core USS motif in the first 10 kb, and these weren’t captured by the selection algorithm. It’s slightly unclear why that is, but might have something to do with the USS position-weight matrix that was used. (Actually, there are six USS in the interval, but we were only searching one strand this time...)

A beginning!

No comments:

Post a Comment