
I think I got somewhere with my degenerate sequencing data... possibly even discovered something unexpected. The first thing I had to do was more or less start over my analysis, and temporarily forget everything I’ve read about “information”, “bits”, “logos”, etc. and just go back to thinking about the experiment from the get-go...
When it comes to DNA, H. influenzae are picky eaters. They seem to find some DNA fragments tastier than others. When I came to the lab, Rosie provided me with a DNA sequence: AAAGTGCGGTTAATTTTTACAGTATTTTTGG. She told me that DNA fragments containing this sequence (the “consensus USS”) are especially delicious to cells, and she wanted to know why.
Well, before I can aspire to helping with that, it'd be helpful to know which parts (or combinations of parts) of this special sequence actually make it delicious. Are all positions equally important? Do different positions make independent contributions to deliciousness, or do they work together to make an especially tasty morsel? Perhaps such insight into H.flu’s palate will get me closer to the bigger project of understanding why H.flu is so picky.
I did the following experiment last month to this end:
(1) I provided a massive smorgasbord of DNA to a whole bunch of cells... hundreds of culinary choices per cell to a billion cells. This tremendous buffet also had ridiculous variety. Every DNA snack was based on Rosie’s extra-tasty sequence, but there were differences in every bite (due to incorporating 24% degeneracy at each base during DNA synthesis). Indeed, there were hundreds of billions of different taste sensations in the input DNA pool.
(2) I harvested the undigested meals from the cells (facilitated by a mutation that kept the food from getting digested). The cells probably had terrible heartburn at that point, but each on average managed to stuff down dozens of DNA molecules.
(3) I sent two DNA samples to my friend to get sequenced: a bit of the original DNA buffet, and the DNA that I purified from the cells’ guts. By comparing the relative abundance of different DNA molecules, we can learn which positions are important to making the sequence tasty and, hopefully, how the different positions work together to impart deliciousness.
(4) Now I have two giant lists of >10 million sequences each from my friend. So far, I’m working with the first 100,000 of each dataset so I can figure out how to process the data. For the below analysis, this is more than sufficient to make accurate frequency measures.
First, I asked how the uptake sample (the H.flu guts) compares to the input sample, treating each position in the sequence independently. To keep things extra simple, I only considered whether a given position on a particular sequence read is matched or mismatched from the original sequence.
So, for each position, I counted how many of the 100,000 reads in each set were mismatched. To calculate the relative enrichment of the mismatched sequence in the uptake sample compared to the input sample, I simply took the ratio of uptake:input, then took the log of that ratio. This normalization makes it so that positive numbers indicate enrichment between uptake and input, while negative numbers indicate depletion. Here’s what I get:
Lovely. It looks like some parts of the original sequence are more important than other parts. For example, sequences that didn’t have a C at position 7 were depleted more than 64-fold in the uptake sample compared with the input sample. This must be an especially important flavor. On the other hand, some of the other positions look like they contribute less to tastiness, and a few positions don’t seem to contribute at all.Great! We could now go and compare this to some other information we have, but for now, I’m going to ignore all of that and move on to a pairwise mismatch analysis. This will tell us how pairwise combinations of mismatches affect tastiness.
First, here’s that first barplot again, with the bars re-colored by the amount of depletion. Mismatches at red positions make the DNA taste bad, while mismatches at blue positions don’t.
That’s the legend for this next image, which shows the log2(uptake/input) for each pair of positions. To make the dataset, I counted how many of the 100,000 reads in each set had mismatches at a particular pair of positions. As with the above, I did not condition on the total number of mismatches in the read, just that at least those two were. On the diagonal is shown the consensus sequence and values corresponding to the above barplot. The plot is symmetric about this diagonal.
The big cross of red is like what we saw in the 1D barplot above... Those same important positions are still important. The cross tells us that, for example, not having a C at position 7 means you taste terrible, independent of where else you have a mismatch.BUT, Hot damn! I think there’s a discovery here beyond that delectable flavor packet of GCG at positions 6-8 (and possibly a few flanking bases).
Note the 3x3 grid of regularly-spaced "mismatch interactions. Note the spacing between these 3 interacting parts of the sequence are just about a helical turn of DNA apart (10-12 base pairs).
I think this picture has some implications for our molecular models...
brain whirring...
(Notes: After getting the number-crunching steps nice and tidy, I spent an inordinate amount of
time making these plots. I started using heatmap(), graduated to heatmap.2(), but ultimately took it to the next level and went fully customized with the image() function. It also took a ridiculously long time to figure out how to have fine-control over the color-scaling. This was necessary to be able to make the legend. I also made a more normal-looking legend (below), which was quite tricky to do right... If I wasn't so distracted by helical turns at the moment, I might explain how I made these in more detail.)
(continued...)
















