Saturday, August 21, 2010

It's more delicious because it's more nutritious

I think I got somewhere with my degenerate sequencing data... possibly even discovered something unexpected. The first thing I had to do was more or less start over my analysis, and temporarily forget everything I’ve read about “information”, “bits”, “logos”, etc. and just go back to thinking about the experiment from the get-go...

When it comes to DNA, H. influenzae are picky eaters. They seem to find some DNA fragments tastier than others. When I came to the lab, Rosie provided me with a DNA sequence: AAAGTGCGGTTAATTTTTACAGTATTTTTGG. She told me that DNA fragments containing this sequence (the “consensus USS”) are especially delicious to cells, and she wanted to know why.

Well, before I can aspire to helping with that, it'd be helpful to know which parts (or combinations of parts) of this special sequence actually make it delicious. Are all positions equally important? Do different positions make independent contributions to deliciousness, or do they work together to make an especially tasty morsel? Perhaps such insight into H.flu’s palate will get me closer to the bigger project of understanding why H.flu is so picky.

I did the following experiment last month to this end:

(1) I provided a massive smorgasbord of DNA to a whole bunch of cells... hundreds of culinary choices per cell to a billion cells. This tremendous buffet also had ridiculous variety. Every DNA snack was based on Rosie’s extra-tasty sequence, but there were differences in every bite (due to incorporating 24% degeneracy at each base during DNA synthesis). Indeed, there were hundreds of billions of different taste sensations in the input DNA pool.

(2) I harvested the undigested meals from the cells (facilitated by a mutation that kept the food from getting digested). The cells probably had terrible heartburn at that point, but each on average managed to stuff down dozens of DNA molecules.

(3) I sent two DNA samples to my friend to get sequenced: a bit of the original DNA buffet, and the DNA that I purified from the cells’ guts. By comparing the relative abundance of different DNA molecules, we can learn which positions are important to making the sequence tasty and, hopefully, how the different positions work together to impart deliciousness.

(4) Now I have two giant lists of >10 million sequences each from my friend. So far, I’m working with the first 100,000 of each dataset so I can figure out how to process the data. For the below analysis, this is more than sufficient to make accurate frequency measures.

First, I asked how the uptake sample (the H.flu guts) compares to the input sample, treating each position in the sequence independently. To keep things extra simple, I only considered whether a given position on a particular sequence read is matched or mismatched from the original sequence.

So, for each position, I counted how many of the 100,000 reads in each set were mismatched. To calculate the relative enrichment of the mismatched sequence in the uptake sample compared to the input sample, I simply took the ratio of uptake:input, then took the log of that ratio. This normalization makes it so that positive numbers indicate enrichment between uptake and input, while negative numbers indicate depletion. Here’s what I get:
Lovely. It looks like some parts of the original sequence are more important than other parts. For example, sequences that didn’t have a C at position 7 were depleted more than 64-fold in the uptake sample compared with the input sample. This must be an especially important flavor. On the other hand, some of the other positions look like they contribute less to tastiness, and a few positions don’t seem to contribute at all.

Great! We could now go and compare this to some other information we have, but for now, I’m going to ignore all of that and move on to a pairwise mismatch analysis. This will tell us how pairwise combinations of mismatches affect tastiness.

First, here’s that first barplot again, with the bars re-colored by the amount of depletion. Mismatches at red positions make the DNA taste bad, while mismatches at blue positions don’t.
That’s the legend for this next image, which shows the log2(uptake/input) for each pair of positions. To make the dataset, I counted how many of the 100,000 reads in each set had mismatches at a particular pair of positions. As with the above, I did not condition on the total number of mismatches in the read, just that at least those two were. On the diagonal is shown the consensus sequence and values corresponding to the above barplot. The plot is symmetric about this diagonal.
The big cross of red is like what we saw in the 1D barplot above... Those same important positions are still important. The cross tells us that, for example, not having a C at position 7 means you taste terrible, independent of where else you have a mismatch.

BUT, Hot damn! I think there’s a discovery here beyond that delectable flavor packet of GCG at positions 6-8 (and possibly a few flanking bases).

Note the 3x3 grid of regularly-spaced "mismatch interactions. Note the spacing between these 3 interacting parts of the sequence are just about a helical turn of DNA apart (10-12 base pairs).

I think this picture has some implications for our molecular models...

brain whirring...

(Notes: After getting the number-crunching steps nice and tidy, I spent an inordinate amount of time making these plots. I started using heatmap(), graduated to heatmap.2(), but ultimately took it to the next level and went fully customized with the image() function. It also took a ridiculously long time to figure out how to have fine-control over the color-scaling. This was necessary to be able to make the legend. I also made a more normal-looking legend (below), which was quite tricky to do right... If I wasn't so distracted by helical turns at the moment, I might explain how I made these in more detail.)


  1. Palette? Palate?

    Is this really a matrix of interactions, or just of the products of importances? I think those are quite different things. If you made a matrix of the products of the numbers of the one-dimensional graph, would you get the two-dimensional graph?

    Are the inportances in the 2D plot normalized to the importance of each base considered singly in the 1D plot? If yes, then the colours in the 2D plot tell us that there is an interaction effect between these positions. But if they're not, then this is just the product of the importances.

    To get the full interaction effects, I think you need to consider more information. Specifically, when considering the cases where two specific bases don't match the consensus, I think you need to consider WHICH non-consensus bases are at those positions, and whether there are any correlations between them. For example, in the 2D graph, cases where positions 4 and 26 are mismatched might occur a bit more often than one might expect (a bit more often than the product of the individual expectations). To say that this is because of an interaction effect, you'd want to get the frequency of each base combination, and then see if any non-consensus combinations are overrepresented.

  2. So, I think the covariation analysis might best be done this way:

    Take the pairwise variation data (the 2D graph) and normalize it by the expected pairwise variation (the product of the two single-position variation values).

    Plot that. Pairs with scores >1 and scores <1 are candidates for interaction effects.

    For the candidate pairs, compare the frequencies of each combination to the expected frequencies given the single-position values.

    Plot a 4x4 array showing these. For positions where the overall normalized analysis showed interaction score <1, we expect to see that the consensus pair is underrepresented and some combination of bases other than the consensus is overrepresented. For positions where the score was >1, we expect to see that the consensus pair is overrepresented and all non-consensus pairs are underrepresented.

    Or maybe vice versa, if I've gotten this confused.

  3. Right. Palate.

    It is appropriate to think of the 2D plot above as an "importance" plot, rather than an interaction plot. But it's not simply made by multiplying single frequencies together... That would be the "expected matrix" for the analysis you suggest above (I think).

    I started with this view, because it can be directly related to the simpler barplot above it, so had some intuitive appeal for me.

    The reason this is not truly an "interaction plot" is because I am only reporting the observed enrichment/depletion of pairwise mismatches. As I alluded to above, if I normalized these values to an expected matrix as you describe above to produce "interaction plot", the cross should mostly disappear... Having a mismatch at position 7 does not help you predict the chance of a mismatch anywhere else.

    I suspect, based on staring at this thing, that the AT-tracts and position 4 will show an interaction.

    And then of course I have to examine what the mismatches actually are, but baby steps over here... baby steps. When I try to jump straight in, I get confused...

  4. Right. If you normalize to an 'expected' matrix then the cross should mostly disappear. What remains will be the evidence for true interactions, which then need to be investigated individually.