Tuesday, August 10, 2010

The anticipation is palpable...

We’re resubmitting a grant soon, in part of which we propose to measure the specificity of DNA uptake in H. influenzae for the “USS motif”, a ~28 bp sequence motif that is sufficient to mobilize efficient DNA uptake. One of the things we want to know is just how the motif lends itself to efficient uptake by competent cells. In previous work, the lab has shown that point mutations at positions in the motif with very high consensus sometimes affect uptake efficiency, but other times do not.

Just before leaving for vacation, I managed to submit exciting samples to my friend for sequencing. These DNA samples are being sequenced AS WE SPEAK!!!!!!! RIGHT NOW!!!


The two samples were: (1) a complex mix of 200 bp fragments containing a degenerate USS with a 24% chance of a mismatch at each base of a consensus USS 32mer; and (2) an enriched fraction of these fragments taken up by competent cells into the periplasm. My previous post on my pilot experiment describes the experiment in more detail with additional links to other posts on the degenerate USS.

In brief, I am trying to compare a complex input pool of DNA fragments with what is actually taken up by cells to make a new uptake motif, based on the uptake process itself, rather than inferring it from genome sequence analysis.

This is the uptake saturation experiment I did with the consensus USS (USS-C) and the degenerate USS (USS-24D):
As I've seen before, USS-24D is not taken up as well as USS-C, since there are many suboptimal sequences that are inefficiently taken up. At very high DNA concentrations, similar amounts of USS-C and USS-24D are taken up. This is presumably because there are enough optimal fragments to saturate the uptake machinery at very high USS-24D concentrations.

I purified the periplasmic DNA from three of these USS-24D uptake samples: SUB (10 ng/ml), MID (70 ng/ml), and SAT (508 ng/ml). Because the yields were quite low, I used PCR to produce more material from these periplasmic DNA preps. It took a little effort to optimize this PCR. Perhaps I will get back to the weird artifacts I discovered at some later point, but suffice it to say, I ensured that I had a clean PCR and did few rounds of PCR, so that I wouldn't alter the complexity of the library too much.

I then took this periplasmic DNA (which should be enriched for optimal sequences) and re-incubated it with fresh competent cells. If the experiment worked (so the DNA samples are worth sequencing), then the periplasmic USS-24D prep should be taken up better than the original input. Indeed this was the case:
Note, this experiment did not take the saturation curve as far out as the original, due to limits on the number of samples I could process in a single experiment and limits on available material before leaving for vacation.

Unfortunately, I didn't see an appreciable difference between the three different periplasmic preps (SUB, MID, and SAT), which I'd hoped to. I had originally thought that periplasmic DNA recovered for unsaturated uptake experiments would include more suboptimal sequences, while saturated experiments would yield only optimal sequences. I hoped that this contrast would allow me to investigate competition between different sequences. Oh well.

So I decided to send only two samples for sequencing: the input and MID. These are the ones BEING SEQUENCED RIGHT NOW!!! WOO!!!

It's still possible the different purifications would behave differently at the higher end of the saturation curve (i.e. perhaps SAT would saturate sooner than SUB). So it's probably worth running some more saturation curves to see if its worth sequencing the other samples. It'd be really nice to have an actual experimental condition changing to get the most accurate depiction of uptake specificity.

SO, what do I need to do next?

(1) Prepare for the coming data deluge: I think I know how to start the data analysis once I've processed the raw read data into a comprehensible ~2e7 rows X 32 columns (though probably in a computationally slow fashion... meh). What I'm less prepared for is the initial processing of the data. I designed the experiment to sequence 4 non-degenerate bases upstream and 6 non-degenerate bases downstream of the USS, so I will probably do a simple crude quality filter, where I demand the first 4 and last 6 bases are exactly correct. This will tend to eliminate poor reads and force the alignment of the 32 degenerate bases in the middle to already be correct. This filter will exclude erroneous oligo synthesis or fragment construction that introduced indels into the USS, which will simplify things initially. At this point, I will also need to apply a base quality filter to ensure I ignore base calls that have low confidence. Sequencing error is a problem for this analysis, so its important I use stringent filters. Even if I only ended up with 10% of the raw reads, I'd have an enormous amount of data for sequence motif analysis.

(2) Prepare for abject failure. It's possible I screwed up the design of the constructs or that there is some unanticipated challenge sequencing through those 32 bases or using this approach. I'll know in a few days, but need to think about what the next step would be, if indeed I misplaced some bases in my reverse engineering scheme.

(3) Prepare for a seeming failure that isn't really. I may have screwed nothing up but get back data from Illumina's pipeline that says something's terribly wrong. I don't fully understand the details of this, but the base-calling step in Illumina's pipeline (which is reading the raw image files from each sequencing step) may be screwball, because of the extremely skewed base composition my constructs will have during each cycle. E.g. the first base for every single cluster should be "A", whereas 76% of clusters should have "A" at the fifth base. Apparently, this can create some weird artifacts in the data processing step, which I need to be prepared for (and not despair when I find out "it didn't work". Mostly this will involve working with my friend and his sequencing facility to re-run the base-calling with an alternate pipeline.

(4) Work on updating the grant for re-submission. There are several paragraphs of our grant application that need to be modified to show our progress. To some extent, this will involve the soon-to-come data, but I can begin by identifying the parts that will need changes an including some different figures.

(5) Work out the molecular biology. I've done a bunch of uptake experiments in a bunch of settings, and I should now have enough to put together some kind of little story, even without the sequence data. With a mind towards a paper, I need to work out just what I need to do. Do I need to do another sequencing experiment with a different construct (more or less degenerate) or under different conditions? Or is a single periplasmic enrichment enough? If the latter, i will certainly want to show a bunch of other experimental data, but which experiments? And which need to be repeated. My first step in this direction is to go through my notebook and figure out what I have...

(6) Work out a large-scale periplasmic prep. I circumvented this for the degenerate USS experiment by doing PCR. SInce I know my exact construct, I could use PCR to amplify it and didn't need to have a good yield, nor particularly pure of a prep (since genomic DNA won't amplify using the construct's primers). However, if I want to look at uptake across a chromosome, I will need to fractionate periplasmic DNA away from chromosomal DNA both to a high level of purity and with a high yield. I've accomplished each of these individually, but so far have not managed a large-scale periplasmic prep that leaves me with enough pure DNA to make sequencing libraries reliably. I refuse to use whole-genome amplification for this experiment.

No comments:

Post a Comment