Friday, November 6, 2009

Weening myself off of Excel

In some sense, the computing I did today isn’t really useful, since I already worked out these things using Microsoft Excel. But I’ve been ordered by my bioinformatics consultants to stop with the Excel already. So as practice, I worked out some of the expected features of degenerate oligos again, but this time using R.

The main motivation for doing this besides practice is that I am fairly sure we should be ordering degenerate oligos with more degeneracy than we have previously considered. I won't make that argument here, but just repeat some analytical graphs I'd previously made.

It took a while (since I’m learning), but was still much more straight-forward than doing it in a spreadsheet. The exercise was extremely useful, as I learned a bunch of stuff (especially about plots in R), while doing the following:

Problem #1: Given a percentage of degeneracy per base, d, in an n length oligo, what is the proportion of oligos with k mismatches?
Answer #1: Use the binomial distribution. For a 32mer with different levels of degeneracy (shown in legend):
Problem #2: Given a million instances of such an oligo, how well would each possible oligo with k mismatches be observed?
Answer #2: Simply adjust each of the above values by dividing the number of classes within each of k mismatches (i.e. choose(n, k)):
Problem #3: If some number of bases, m, in the n-length oligo are “important”, what proportion of oligos with k mismatches will have x “hits”?
Answer #3: Use the hypergeometric distribution. The below plot is as for Problem #1 for 0.12 degeneracy, but with the # of hits broken down for each k:
I didn't try super-hard to make the perfect graphs, but it did take some effort to make a stacked bar plot...


  1. In #1 and #2, is it possible to have R draw theY axis going through zero? That would make values easier to estimate.

    And in #3, what's the value of m? Am I right in thinking that the graph shows that 98% of oligos will have at least one mismatch to the consensus but only about 65% will have at least one of these in an important position?

  2. Yeah. I figured out how to add lines using the "abline" function. For #2, I should probably be focused on only a part of the displayed graph too.

    As for #3, m=8, so 1/4 of the 32 positions are presumed to be important. And your approximations are about right...