Saturday, August 21, 2010

It's more delicious because it's more nutritious

I think I got somewhere with my degenerate sequencing data... possibly even discovered something unexpected. The first thing I had to do was more or less start over my analysis, and temporarily forget everything I’ve read about “information”, “bits”, “logos”, etc. and just go back to thinking about the experiment from the get-go...

When it comes to DNA, H. influenzae are picky eaters. They seem to find some DNA fragments tastier than others. When I came to the lab, Rosie provided me with a DNA sequence: AAAGTGCGGTTAATTTTTACAGTATTTTTGG. She told me that DNA fragments containing this sequence (the “consensus USS”) are especially delicious to cells, and she wanted to know why.

Well, before I can aspire to helping with that, it'd be helpful to know which parts (or combinations of parts) of this special sequence actually make it delicious. Are all positions equally important? Do different positions make independent contributions to deliciousness, or do they work together to make an especially tasty morsel? Perhaps such insight into H.flu’s palate will get me closer to the bigger project of understanding why H.flu is so picky.

I did the following experiment last month to this end:

(1) I provided a massive smorgasbord of DNA to a whole bunch of cells... hundreds of culinary choices per cell to a billion cells. This tremendous buffet also had ridiculous variety. Every DNA snack was based on Rosie’s extra-tasty sequence, but there were differences in every bite (due to incorporating 24% degeneracy at each base during DNA synthesis). Indeed, there were hundreds of billions of different taste sensations in the input DNA pool.

(2) I harvested the undigested meals from the cells (facilitated by a mutation that kept the food from getting digested). The cells probably had terrible heartburn at that point, but each on average managed to stuff down dozens of DNA molecules.

(3) I sent two DNA samples to my friend to get sequenced: a bit of the original DNA buffet, and the DNA that I purified from the cells’ guts. By comparing the relative abundance of different DNA molecules, we can learn which positions are important to making the sequence tasty and, hopefully, how the different positions work together to impart deliciousness.

(4) Now I have two giant lists of >10 million sequences each from my friend. So far, I’m working with the first 100,000 of each dataset so I can figure out how to process the data. For the below analysis, this is more than sufficient to make accurate frequency measures.

First, I asked how the uptake sample (the H.flu guts) compares to the input sample, treating each position in the sequence independently. To keep things extra simple, I only considered whether a given position on a particular sequence read is matched or mismatched from the original sequence.

So, for each position, I counted how many of the 100,000 reads in each set were mismatched. To calculate the relative enrichment of the mismatched sequence in the uptake sample compared to the input sample, I simply took the ratio of uptake:input, then took the log of that ratio. This normalization makes it so that positive numbers indicate enrichment between uptake and input, while negative numbers indicate depletion. Here’s what I get:
Lovely. It looks like some parts of the original sequence are more important than other parts. For example, sequences that didn’t have a C at position 7 were depleted more than 64-fold in the uptake sample compared with the input sample. This must be an especially important flavor. On the other hand, some of the other positions look like they contribute less to tastiness, and a few positions don’t seem to contribute at all.

Great! We could now go and compare this to some other information we have, but for now, I’m going to ignore all of that and move on to a pairwise mismatch analysis. This will tell us how pairwise combinations of mismatches affect tastiness.

First, here’s that first barplot again, with the bars re-colored by the amount of depletion. Mismatches at red positions make the DNA taste bad, while mismatches at blue positions don’t.
That’s the legend for this next image, which shows the log2(uptake/input) for each pair of positions. To make the dataset, I counted how many of the 100,000 reads in each set had mismatches at a particular pair of positions. As with the above, I did not condition on the total number of mismatches in the read, just that at least those two were. On the diagonal is shown the consensus sequence and values corresponding to the above barplot. The plot is symmetric about this diagonal.
The big cross of red is like what we saw in the 1D barplot above... Those same important positions are still important. The cross tells us that, for example, not having a C at position 7 means you taste terrible, independent of where else you have a mismatch.

BUT, Hot damn! I think there’s a discovery here beyond that delectable flavor packet of GCG at positions 6-8 (and possibly a few flanking bases).

Note the 3x3 grid of regularly-spaced "mismatch interactions. Note the spacing between these 3 interacting parts of the sequence are just about a helical turn of DNA apart (10-12 base pairs).

I think this picture has some implications for our molecular models...

brain whirring...

(Notes: After getting the number-crunching steps nice and tidy, I spent an inordinate amount of time making these plots. I started using heatmap(), graduated to heatmap.2(), but ultimately took it to the next level and went fully customized with the image() function. It also took a ridiculously long time to figure out how to have fine-control over the color-scaling. This was necessary to be able to make the legend. I also made a more normal-looking legend (below), which was quite tricky to do right... If I wasn't so distracted by helical turns at the moment, I might explain how I made these in more detail.)

Tuesday, August 17, 2010

Direct uptake specificity measurements

So I got a bit distracted from our plans with the Genome BC dough, since I got some more sequencing data! Woo! I'll return to the design of our 92 clones worth of sequencing later. For now, I am trying to work through some more grant writing, incorporating our findings from the new sequence data.

Below, I'll (try to) quickly summarize what I've found with my (very) preliminary data analysis and then try to describe the thing I'm having a hard time writing clearly about...

First off: It worked! My scheming panned out quite nicely. I got ~15 million single-end sequence reads from each of two DNA pools. One was from the input 24% degenerate USS-containing fragment. The other was from the periplasm-purified DNA fragments. The experiment and its outcome were described here and here.

Filtering the data:

The first thing I had to do was somehow parse the raw data file. For my first pass, I used a quick and dirty, but probably quite stringent, quality filter:
  1. I excluded any read where the four first and six last bases were not exactly what they should be. (These sequenced positions were supposed to be non-degenerate, so should have zero mismatches from the expected input sequence.) This excludes reads with known "bad" bases and forces the alignment to be correct. The unfiltered data contained a bunch of reads that clearly had insertions or deletions within the USS, probably due to the limitations of the original oligo synthesis.
  2. I excluded all reads that contained any Ns. This makes the analysis a bit simpler, especially when calculating the number of mismatches in the degenerate USS from consensus.
This left me with ~10 million reads in each dataset. That's a lot of loss, and I certainly will try and return to those sequence reads, but for now, leaving them out is a decent filter that will simplify the analysis (mainly by forcing a "true" alignment).


I piped a few greps together to pull all the sequence reads matching my criteria (the first four bases ATGC, the last six bases GGTCGA, and no Ns). From a UNIX terminal:
grep ^ATGC sequence.txt | grep GGTCGA$ | grep -v N > newfile.txt
That's it! (Someday, I should probably do something with all the quality scores, but for now, I'm ignoring them. This operation also lost all the read IDs.)

I soon found that when analyzing these new files that my poor computer-foo is making very slow-running scripts, so I took the first 100,000 reads from each filtered dataset onwards with a simple:
head -n 100000 newfile.txt
Making a Position Weight matrix:

I next calculated the frequency of each base at each position for the two datasets (generating two position weight matrices, or PWMs). Luckily, I have previously worked out how to use the input PWM to background-correct the periplasm-purified PWM here and here.

Details: On to some R...

I read in the data with scan. I've usually used thing like read.table or read.csv, but scan looks like the way to go with a big giant vector:
sequences <- scan("newfile.txt", what="character" )
Next, I calculated the per position base frequencies:
basesread <- 42
newmatrix <- matrix(, basesread, 4)

for (i in 1:basesread) { newmatrix[i, ] <-
table(substr(sequences, i, i))

Hmmm... Would've been nice to avoid that loop...

From the two matrices, I got a background-corrected periplasmic uptake specificity motif that looks like this:
This excludes the fixed bases and one other anomalous base at the beginning of the read.
(Sorry I can't use Weblogo or other things that make nicer logos, as none accomodate position-specific base composition.)

Hey, look at that C at position 7! It just sticks way out! That must be a particularly important residue for uptake. A previous postdoc and graduate student both had measured uptake efficiency for fragments with point mutations in a consensus USS. This data offers a great independent validation of this basic result. Here's what that data looks like:
It looks like the two approaches are giving comparable results. Fabulous!

The next step:

So what else can we learn from this much larger deep sequencing dataset? I'm sure all sorts of things. For now, the only other thing I've done is calculate the number of mismatches between each of the 100,000 filtered sequence reads and the consensus USS the degenerate construct was based on.

Here's what the distribution of mismatch number looks like for the two datasets:

Awesome! Shifted, just as previously simulated!


At first I tried a bunch of extremely laborious and slow ways of doing this. Then I found a package called "cba" with a function called "sdists", which can do edit distances with vectors of character strings. (The sdists step is how I murdered the computers when I tried using the full dataset.)

consensus <- "ATGCCAAAGTGCGGTTAATTTTTACAGTATTTTTGGGTTCGA" distances <- as.vector(sdists(consensus, sequences, weight=c(1,0,1) ) )

I guess I could just do it in chunks for the whole dataset to save on RAM...

Anyways, this vector of distances is going to serve as an important index to the sequence reads for the next set of analyses (which will look at co-variation between positions and how specific mismatches affect the motif).


Okay, now to try and describe the thing I'm having trouble saying in a succinct way. These information content measurements, as in the above motif, are in bits. This is "information content", but can also be thought of as "surprise". This is why the USS motif derived from the genome has so much higher information content than the new experimentally determined uptake motif. (Here's the genomic USS:)
Because most of the bases at a given position in the degenerate USS pool were already the preferred consensus base, little "surprise" was elicited when the preferred base was enriched in the periplasm-purified pool. As measure in bits. That's why the scale is so different between the two motifs.

Anyways, the issue arises when thinking about some other possible experiments. For example, if uptake specificity were partially modified (as we think we can accomplish using engineered strains), the positions with a new specificity would become extremely informative relative to the other when using the normal degenerate fragments. But if a new construct were designed based on the altered specificity, these positions would no longer have such elevated information content relative to the other bases...

This isn't actually a bad thing--in fact we can exploit it--it's just something hard to describe clearly...

Wednesday, August 11, 2010

How to spend some dough?

We have some money to spend (kindly provided by Genome BC) for identifying recombination tracts in individual transformants, where donor genomic DNA from the clinical isolate NP is used to transform Rd competent cells (NP-into-Rd transformations). This will help us answer some very basic questions about natural transformation: What are the numbers and sizes of recombination tracts in individual transformants? Are different parts of the genome equally transformable? Do mismatch repair and other degradative mechanisms limit transformation?

We already have a few clones worth of data (which is now mostly analyzed), but we need more clone sequences to do any worthwhile statistics. Actually, I've made nicer pictures and done some re-analysis for the poster I took to the Evolution meeting. I should post some of that sometime...

We’d originally had a cost-saving plan of sequencing overlapping pools of clones to obtain data from 64 clones. We even came up with a slicker plan (since sequence yields keep increasing), in which we'd get 128 clones worth of data, and 64 of these would be "de-convoluted":
But now we’ve discovered that our local genome center provides inexpensive indexing of DNA samples (i.e. barcoding). This means we can obtain 92 clones worth of data for only ~$15K. Wow!
So now, I need to collect the clones for sequencing and extract their DNA. The question is what clones? I will be sequencing 4 sets of 23 clones (plus a control), so might think of the data collection in that manner. (This number of indexed clones per pool will be sufficient to obtain ~50-100 fold median coverage per clone, given current estimated yields.) Three are two things to consider: (a) technical concerns, and (b) using some clones to look at different kinds of transformations.

This post will cover the basic technical concerns, and I’ll use the next post to discuss some alternative things we might do with some of this sequencing capacity.

To deal with potential criticism later, I should collect the NP-into-Rd transformants from three separate transformation experiments using three separate competent cell preps. This will allow me to make statements about how reproducible the distribution of recombination tracts between transformations is. Because each replicate will only have a couple dozen clones, this will likely not be sufficient for detecting subtle differences between the independent replicates, but will be sufficient in determining overall consistency of different transformations

Selection for transformants:
Since only a fraction of cells in a competent culture appear to be competent (as measured by co-transformation of unlinked marker loci), I will again select for either the NovR or NalR alleles of gyrB and gyA (respectively) that I’ve already added to the donor NP strain. This selection step ensures that every sequenced clone has at least one recombination tract, thereby selecting against clones derived from non-competent cells.

With indexing, it shouldn’t matter how I organize the clones submitted for sequencing, but this data can also be analyzed without paying attention to the index (for which I have my reasons), so it would make sense to organize a plate like this:

We’d originally decided 64 additional clones (atop the 4 we’ve got) would be sufficient for a basic picture. Now we can do 92 for much less money. Should we then do 92 NP->Rd selected transformants instead of 64, or should we use some of this new sequencing space to investigate new things?

Doing more clones would give us greater statistical power, but the more we do the more get will get diminishing returns. Since we estimate that ~60-70 clones will be sufficient for a nice first look, maybe we don’t gain much by doing 92 of the “same” thing and would be served better by using some of the extra space for other endeavours.

For example, if we reserved one set (23 clones) for one or more other DNA samples, we’d still get 69 clones worth of recombination tracts. What could we do with this extra space (keeping in mind that these should be transformations that can be done immediately)?

The most important factor to consider with each of these is whether the sample size (23 clones) would be sufficient to obtain useful data, despite having given up some statistical power for the primary set of 69 clones. (In my mind this means what we get should publishable as is alone or with the primary dataset, but could also mean several small pilot experiments for “preliminary data”.)

There are a whole lot of things we might do with this space… (stay tuned)

Tuesday, August 10, 2010

The anticipation is palpable...

We’re resubmitting a grant soon, in part of which we propose to measure the specificity of DNA uptake in H. influenzae for the “USS motif”, a ~28 bp sequence motif that is sufficient to mobilize efficient DNA uptake. One of the things we want to know is just how the motif lends itself to efficient uptake by competent cells. In previous work, the lab has shown that point mutations at positions in the motif with very high consensus sometimes affect uptake efficiency, but other times do not.

Just before leaving for vacation, I managed to submit exciting samples to my friend for sequencing. These DNA samples are being sequenced AS WE SPEAK!!!!!!! RIGHT NOW!!!

The two samples were: (1) a complex mix of 200 bp fragments containing a degenerate USS with a 24% chance of a mismatch at each base of a consensus USS 32mer; and (2) an enriched fraction of these fragments taken up by competent cells into the periplasm. My previous post on my pilot experiment describes the experiment in more detail with additional links to other posts on the degenerate USS.

In brief, I am trying to compare a complex input pool of DNA fragments with what is actually taken up by cells to make a new uptake motif, based on the uptake process itself, rather than inferring it from genome sequence analysis.

This is the uptake saturation experiment I did with the consensus USS (USS-C) and the degenerate USS (USS-24D):
As I've seen before, USS-24D is not taken up as well as USS-C, since there are many suboptimal sequences that are inefficiently taken up. At very high DNA concentrations, similar amounts of USS-C and USS-24D are taken up. This is presumably because there are enough optimal fragments to saturate the uptake machinery at very high USS-24D concentrations.

I purified the periplasmic DNA from three of these USS-24D uptake samples: SUB (10 ng/ml), MID (70 ng/ml), and SAT (508 ng/ml). Because the yields were quite low, I used PCR to produce more material from these periplasmic DNA preps. It took a little effort to optimize this PCR. Perhaps I will get back to the weird artifacts I discovered at some later point, but suffice it to say, I ensured that I had a clean PCR and did few rounds of PCR, so that I wouldn't alter the complexity of the library too much.

I then took this periplasmic DNA (which should be enriched for optimal sequences) and re-incubated it with fresh competent cells. If the experiment worked (so the DNA samples are worth sequencing), then the periplasmic USS-24D prep should be taken up better than the original input. Indeed this was the case:
Note, this experiment did not take the saturation curve as far out as the original, due to limits on the number of samples I could process in a single experiment and limits on available material before leaving for vacation.

Unfortunately, I didn't see an appreciable difference between the three different periplasmic preps (SUB, MID, and SAT), which I'd hoped to. I had originally thought that periplasmic DNA recovered for unsaturated uptake experiments would include more suboptimal sequences, while saturated experiments would yield only optimal sequences. I hoped that this contrast would allow me to investigate competition between different sequences. Oh well.

So I decided to send only two samples for sequencing: the input and MID. These are the ones BEING SEQUENCED RIGHT NOW!!! WOO!!!

It's still possible the different purifications would behave differently at the higher end of the saturation curve (i.e. perhaps SAT would saturate sooner than SUB). So it's probably worth running some more saturation curves to see if its worth sequencing the other samples. It'd be really nice to have an actual experimental condition changing to get the most accurate depiction of uptake specificity.

SO, what do I need to do next?

(1) Prepare for the coming data deluge: I think I know how to start the data analysis once I've processed the raw read data into a comprehensible ~2e7 rows X 32 columns (though probably in a computationally slow fashion... meh). What I'm less prepared for is the initial processing of the data. I designed the experiment to sequence 4 non-degenerate bases upstream and 6 non-degenerate bases downstream of the USS, so I will probably do a simple crude quality filter, where I demand the first 4 and last 6 bases are exactly correct. This will tend to eliminate poor reads and force the alignment of the 32 degenerate bases in the middle to already be correct. This filter will exclude erroneous oligo synthesis or fragment construction that introduced indels into the USS, which will simplify things initially. At this point, I will also need to apply a base quality filter to ensure I ignore base calls that have low confidence. Sequencing error is a problem for this analysis, so its important I use stringent filters. Even if I only ended up with 10% of the raw reads, I'd have an enormous amount of data for sequence motif analysis.

(2) Prepare for abject failure. It's possible I screwed up the design of the constructs or that there is some unanticipated challenge sequencing through those 32 bases or using this approach. I'll know in a few days, but need to think about what the next step would be, if indeed I misplaced some bases in my reverse engineering scheme.

(3) Prepare for a seeming failure that isn't really. I may have screwed nothing up but get back data from Illumina's pipeline that says something's terribly wrong. I don't fully understand the details of this, but the base-calling step in Illumina's pipeline (which is reading the raw image files from each sequencing step) may be screwball, because of the extremely skewed base composition my constructs will have during each cycle. E.g. the first base for every single cluster should be "A", whereas 76% of clusters should have "A" at the fifth base. Apparently, this can create some weird artifacts in the data processing step, which I need to be prepared for (and not despair when I find out "it didn't work". Mostly this will involve working with my friend and his sequencing facility to re-run the base-calling with an alternate pipeline.

(4) Work on updating the grant for re-submission. There are several paragraphs of our grant application that need to be modified to show our progress. To some extent, this will involve the soon-to-come data, but I can begin by identifying the parts that will need changes an including some different figures.

(5) Work out the molecular biology. I've done a bunch of uptake experiments in a bunch of settings, and I should now have enough to put together some kind of little story, even without the sequence data. With a mind towards a paper, I need to work out just what I need to do. Do I need to do another sequencing experiment with a different construct (more or less degenerate) or under different conditions? Or is a single periplasmic enrichment enough? If the latter, i will certainly want to show a bunch of other experimental data, but which experiments? And which need to be repeated. My first step in this direction is to go through my notebook and figure out what I have...

(6) Work out a large-scale periplasmic prep. I circumvented this for the degenerate USS experiment by doing PCR. SInce I know my exact construct, I could use PCR to amplify it and didn't need to have a good yield, nor particularly pure of a prep (since genomic DNA won't amplify using the construct's primers). However, if I want to look at uptake across a chromosome, I will need to fractionate periplasmic DNA away from chromosomal DNA both to a high level of purity and with a high yield. I've accomplished each of these individually, but so far have not managed a large-scale periplasmic prep that leaves me with enough pure DNA to make sequencing libraries reliably. I refuse to use whole-genome amplification for this experiment.

Hi blog!

So I’ve returned to work after a 3.5 week vacation to Bali! Woo! Time to catch up on some long non-blogging. But before that...

Here’s where my crazy Uncle Jimmy took me and the missus for a few days… Gili Gede (meaning “Little Island Big”) off the coast of Lombok (which appears to be experiencing a gold-rush, so go now before its all built up!).
The ferry between Bali and Lombok took us across the legendary "Wallace Line", though my total ignorance of systematics means I couldn't really tell the difference in flora and fauna between the islands. (Though we did see a HUGE monitor lizard near a brick-making operation.)

I stayed out at the end of this pier on the lower left at the “Secret Island Resort”. (Note the vessel that took us snorkeling to the immediate South... the "Scorpio".
And here's an actual shot of the infamous "Rocky Docky" with the lovely Heather in the foreground:
The reef was right over the edge of the pier, so snorkeling to see all the corals and fishies was optional. You could just look over the edge! Excellent!

Anyways, I've been getting my head back into things and am writing a few blog posts about what’s going on in my world-of-science. So expect a deluge of catch-up posts during the next day or two…