
I think I got somewhere with my degenerate sequencing data... possibly even discovered something unexpected. The first thing I had to do was more or less start over my analysis, and temporarily forget everything I’ve read about “information”, “bits”, “logos”, etc. and just go back to thinking about the experiment from the get-go...
When it comes to DNA, H. influenzae are picky eaters. They seem to find some DNA fragments tastier than others. When I came to the lab, Rosie provided me with a DNA sequence: AAAGTGCGGTTAATTTTTACAGTATTTTTGG. She told me that DNA fragments containing this sequence (the “consensus USS”) are especially delicious to cells, and she wanted to know why.
Well, before I can aspire to helping with that, it'd be helpful to know which parts (or combinations of parts) of this special sequence actually make it delicious. Are all positions equally important? Do different positions make independent contributions to deliciousness, or do they work together to make an especially tasty morsel? Perhaps such insight into H.flu’s palate will get me closer to the bigger project of understanding why H.flu is so picky.
I did the following experiment last month to this end:
(1) I provided a massive smorgasbord of DNA to a whole bunch of cells... hundreds of culinary choices per cell to a billion cells. This tremendous buffet also had ridiculous variety. Every DNA snack was based on Rosie’s extra-tasty sequence, but there were differences in every bite (due to incorporating 24% degeneracy at each base during DNA synthesis). Indeed, there were hundreds of billions of different taste sensations in the input DNA pool.
(2) I harvested the undigested meals from the cells (facilitated by a mutation that kept the food from getting digested). The cells probably had terrible heartburn at that point, but each on average managed to stuff down dozens of DNA molecules.
(3) I sent two DNA samples to my friend to get sequenced: a bit of the original DNA buffet, and the DNA that I purified from the cells’ guts. By comparing the relative abundance of different DNA molecules, we can learn which positions are important to making the sequence tasty and, hopefully, how the different positions work together to impart deliciousness.
(4) Now I have two giant lists of >10 million sequences each from my friend. So far, I’m working with the first 100,000 of each dataset so I can figure out how to process the data. For the below analysis, this is more than sufficient to make accurate frequency measures.
First, I asked how the uptake sample (the H.flu guts) compares to the input sample, treating each position in the sequence independently. To keep things extra simple, I only considered whether a given position on a particular sequence read is matched or mismatched from the original sequence.
So, for each position, I counted how many of the 100,000 reads in each set were mismatched. To calculate the relative enrichment of the mismatched sequence in the uptake sample compared to the input sample, I simply took the ratio of uptake:input, then took the log of that ratio. This normalization makes it so that positive numbers indicate enrichment between uptake and input, while negative numbers indicate depletion. Here’s what I get:

Great! We could now go and compare this to some other information we have, but for now, I’m going to ignore all of that and move on to a pairwise mismatch analysis. This will tell us how pairwise combinations of mismatches affect tastiness.
First, here’s that first barplot again, with the bars re-colored by the amount of depletion. Mismatches at red positions make the DNA taste bad, while mismatches at blue positions don’t.


BUT, Hot damn! I think there’s a discovery here beyond that delectable flavor packet of GCG at positions 6-8 (and possibly a few flanking bases).
Note the 3x3 grid of regularly-spaced "mismatch interactions. Note the spacing between these 3 interacting parts of the sequence are just about a helical turn of DNA apart (10-12 base pairs).
I think this picture has some implications for our molecular models...
brain whirring...
(Notes: After getting the number-crunching steps nice and tidy, I spent an inordinate amount of

Palette? Palate?
ReplyDeleteIs this really a matrix of interactions, or just of the products of importances? I think those are quite different things. If you made a matrix of the products of the numbers of the one-dimensional graph, would you get the two-dimensional graph?
Are the inportances in the 2D plot normalized to the importance of each base considered singly in the 1D plot? If yes, then the colours in the 2D plot tell us that there is an interaction effect between these positions. But if they're not, then this is just the product of the importances.
To get the full interaction effects, I think you need to consider more information. Specifically, when considering the cases where two specific bases don't match the consensus, I think you need to consider WHICH non-consensus bases are at those positions, and whether there are any correlations between them. For example, in the 2D graph, cases where positions 4 and 26 are mismatched might occur a bit more often than one might expect (a bit more often than the product of the individual expectations). To say that this is because of an interaction effect, you'd want to get the frequency of each base combination, and then see if any non-consensus combinations are overrepresented.
So, I think the covariation analysis might best be done this way:
ReplyDeleteTake the pairwise variation data (the 2D graph) and normalize it by the expected pairwise variation (the product of the two single-position variation values).
Plot that. Pairs with scores >1 and scores <1 are candidates for interaction effects.
For the candidate pairs, compare the frequencies of each combination to the expected frequencies given the single-position values.
Plot a 4x4 array showing these. For positions where the overall normalized analysis showed interaction score <1, we expect to see that the consensus pair is underrepresented and some combination of bases other than the consensus is overrepresented. For positions where the score was >1, we expect to see that the consensus pair is overrepresented and all non-consensus pairs are underrepresented.
Or maybe vice versa, if I've gotten this confused.
Right. Palate.
ReplyDeleteIt is appropriate to think of the 2D plot above as an "importance" plot, rather than an interaction plot. But it's not simply made by multiplying single frequencies together... That would be the "expected matrix" for the analysis you suggest above (I think).
I started with this view, because it can be directly related to the simpler barplot above it, so had some intuitive appeal for me.
The reason this is not truly an "interaction plot" is because I am only reporting the observed enrichment/depletion of pairwise mismatches. As I alluded to above, if I normalized these values to an expected matrix as you describe above to produce "interaction plot", the cross should mostly disappear... Having a mismatch at position 7 does not help you predict the chance of a mismatch anywhere else.
I suspect, based on staring at this thing, that the AT-tracts and position 4 will show an interaction.
And then of course I have to examine what the mismatches actually are, but baby steps over here... baby steps. When I try to jump straight in, I get confused...
Right. If you normalize to an 'expected' matrix then the cross should mostly disappear. What remains will be the evidence for true interactions, which then need to be investigated individually.
ReplyDelete