Wednesday, September 30, 2009

Struggling with the Background

Previously, I had shown some preliminary analysis of Rosie’s simulated uptake data of chromosomal DNA fragments. Rosie also sent me simulated uptake data of degenerate USS sequences (using a 12% degeneracy per position in this USS consensus sequence :

5’-AAAGTGCGGTCAAATTTCAGTCAATTTTT-3’).

So what can I do with this dataset? Well, first, since Rosie also provided me with the “scores” for each sequence, I could plot a histogram of the scores for the 100 selected and 100 unselected sequences, showing that the uptake algorithm seems to work pretty well.

Here's the histogram:
Notably, even the unselected sequences have rather high scores, when compared to the same analysis of genomic DNA fragments. This is unsurprising, since the sequences under selection in this degenerate USS simulation are all rather close to the consensus USS.

Here’s the histogram from the genomic uptake simulation again (just to compare):
(I think the reason for the difference in the “selected” distributions is due to a different level of stringency when Rosie produced the two simulated datasets.)

That’s all fine and good, but now what? With the genomic dataset, I could use the UCSC genome browser to plot the location of all the fragments I was sent, but this consists of two alignment blocks of sequence that look markedly similar.

The obvious thing to do was to make Weblogos of the two different datasets… The unselected set should have very little information in it, while the selected set should contain information. In doing this, I discovered a rather important issue… The on-line version of Weblogo does NOT, I repeat, does NOT account for the background distribution.

This is a problem. It means that whenever you make a Weblogo (on the webserver) from your alignment block, it is assuming that each base is equally likely to occur at a random position. This is why the y-axis in all Weblogos plots always has a maximum of 2 bits when using DNA sequence. Why is this a problem? First of all, if one is using an AT-rich genome, as we are, then the information content of any G or C is underestimated and any A or T is overestimated.

So how does Weblogo calculate the information content of each position in an alignment block? From the Weblogo paper (link found at the Weblogo website):
Rseq = the information at a particular position in the alignment
Smax = the maximum possible entropy
Sobs = the entropy of the observed distribution of symbols (bases)
N = the number of distinct symbols (4 for DNA)
pn = the observed frequency of symbol n

The log2 is there to put everything in terms of bits.

So for DNA (4 bases), the maximum entropy at a position is 2 bits. Makes perfect sense: 2 bits a base. However, this only makes sense if each base is equally probable for a randomly drawn sequence. Now for purposes of gaining an intuition for different motifs, this isn’t really a big deal, although it does complicate comparing motifs between genomes.

When this isn’t the case (probably much of the time), then different measures have been used, namely the “relative entropy” of a position. This is an odds ratio of the observed probability and the background probability. Apparently, the off-line version of Weblogo can account for non-uniform base composition, but I haven’t tried installing it yet, nor any other software out there that handles variation in GC content.

Why? Because what we need for our degenerate sequences is a different background distribution at each position! So, the first position in the core is 88% A, but the third position is 88% G!

To illustrate the problem, here is a Weblogo of the selected set:
Here’s the unselected set:
Looking closely, it is clear that there are differences in the amount of “information” at each position. So in the strong consensus positions of the USS, the selected set has higher “information” than the unselected set, while at weak consensus positions, that’s less true.

But the scaling of each base here is completely wrong. There isn’t nearly a bit of information at the first position in the unselected set. We expected 88% A. The fact that there are mostly As in the alignment block at the first position is NOT informative. In fact, if all was well, we’d get zero bits at all the unselected positions!

What to do? I tried to make my own logo, using the known true background distribution at each position. I won’t belabor the details too much at the moment, except to say that I had to figure out what a “pseudocount” was and how to incorporate it into the weight matrix, so as to not ever take the logarithm of zero.

Here’s the selected set of 100 sequences:
Here’s the unselected set of 100 sequences:
(Note that I somehow lost the first position when I did this, so the motif starts at the first position of the core.)

This actually looks quite a bit better, or at least more sensible. A few things worth noting:
  • I think if we did thousands of unselected sequences, we’d pretty much get zero information from that alignment, which is what we would want, since that’s just the background distribution.
  • Some values are negative. This is expected. Since these are scaled to log-odds ratios, when the frequency of seeing a certain base is less than the expected background frequency, a negative number emerges.
  • The scale is extremely reduced. Every position is worth less than 0.3 bits. This is also expected. One description I’ve seen about how information content can be thought about is how “surprised” one should be when making an observation (there’s even a unit of measure called a “suprisal”!). Since we are drawing from an extremely non-uniform distribution that actually favors the base that’s expected to be taken up by cells better, we are basically squashing our surprise way down. That is, getting an A at the first position of the core is highly favored, but it’s the most likely base to get anyways, even in the absence of selection.
  • The unimportant bases in the USS have the most information content in the selected set. At first this bothered me, but then I realized it was utterly expected for the same reason as above. For example, at position #18 above (sorry, it’s position 19 in the Weblogos), the selection algorithm doesn’t really care what base is there. That means that the selected set will let mutations at that position (from A to something else) come through, which will be surprising, when compared to the background distribution!
(ADDED LATER: Actually, this last point is wrong. The reason for so much information at the weak positions is related to the matrix that was used to select the sequences, not from surprise. I'll try and get a proper dataset later and redo this analysis. To some extent, the positions will still have some information, as partially explained in my erroneous explanation above, but not nearly so much.)

Whew! I’ve gotta quit now. There’s a lot more to think about here.

1 comment:

  1. I'm glad you repeated the "does NOT", otherwise I may have gotten confused!

    ReplyDelete