Tuesday, May 5, 2009

Degenerate Oligos

Rosie and I have been working our way through editing my unfunded NIH postdoc proposal. There's a lot of work to do on it, but I think we'll make it excellent by the time we're done.

This is our first aim (so far):
Determine the selectivity of the outer membrane uptake machinery for USS-like sequences using degenerate USS constructs.

This is a meaty Aim and will require some time to fully flesh out. I'm not even fully convinced it belongs in this grant. It may be better for the NIH R01 submission exclusively.

The idea is to feed Haemophilus synthetic DNA containing a USS with a few random mistakes in each independent DNA molecule. We'd then purify the DNA that was preferentially taken up by the bacteria into the periplasm and sequence it to high coverage using Illumina's GA2 platform. This experiment has a lot of potential, but also a number of issues we need to work out first before we can seriously discuss the possible analyses.

To start with, I want to try and outline the basic properties of the degenerate USS donor construct we’re planning on having made. Later, I’ll try and tackle issues such as sequencing error and competition between different USS-like sequences that we expect with the actual experiment, as well as the analysis and significance of the data we hope to obtain.

As donor DNA, we want to make a long oligo (~200mer), such that the central 32 bp correspond to a known and previously studied USS, except that there will be a chance at each base in the USS for a mistake to be introduced (i.e. one of the three alternative bases could be added, instead of the correct one).

This will entail doping each bottle of dNTPs in the oligo synthesizing machine (A, T, G, or C) with a few percent of each of the other bases. For example, for a 9% degenerate oligo, the bottle that would normally be filled with 100% dATP would instead be 91% dATP, 3% dTTP, 3% dGTP, and 3% dCTP.

What do I expect the input degenerate oligo pools to look like in a sequencer?

I’ll start somewhat simply. Assuming perfect degenerate oligo synthesis and no sequencing error, I can use binomial probabilities to calculate how many oligos per million would be expected to have a certain number of incorrect bases. And I can further calculate the coverage of a given oligo species per million sequence reads. (A million sequence reads is used as a convenient frame of reference for thinking about deep sequencing. Normalizing to parts per million is also something I've seen others do.)


n = length of the oligo to be considered (n=32 hereafter)
p = probability that the correct base is added at a particular position
q = 1-p = probability that an incorrect base is added at a particular position
k = the number of incorrect bases in a particular oligo


( n k ) = total number of possible oligos with k incorrect bases
= “n choose k”
= n! / [ k! * (n-k)! ]

nPk = binomial probability of an oligo having k incorrect bases
= ( n k ) * p^(n-k) * q^k

ppm = expected number of sequence reads per million with k incorrect bases
= nPk * 10^6

seq = number of sequences with k mistakes
= ( n k ) * 3^k

coverage = expected reads per million of a given oligo species with k incorrect bases
= ppm / seq

n = 32
p = 0.88
q = 0.12

This means that if we produce a degenerate 32mer oligo, in which each position has a 12% chance of incorporating an incorrect base (any one of the alternate 3 bases with equal probability), we would expect 21% of the molecules to have three mistakes.

But there are 4,960 different ways in which a 32mer could have 3 mistakes ( n k ). Furthermore, there are 3^3 = 81 different ways that there could be a 3 base mistake in an oligo, yielding 133,920 distinct oligo species that have 3 mistakes (seq). So out of a million sequence reads, we would only expect to see a given species of oligo with three mistakes 1.57 times per million sequence reads (i.e. 1.57X coverage of each possible 3 mistake oligo).

Here are some graphs to illustrate what this looks like for different levels of degeneracy. These should probably be histograms, but for now, I’m keeping them as they are.

Graph #1: The expected number of sequence reads (per million) with k incorrect bases in the 32mer oligo are shown for different levels of degeneracy.

Graph #2: The expected fold-coverage (per million reads) of individual oligo species with k incorrect bases are shown for different levels of degeneracy.

The take-away message: With moderate levels of degeneracy, we can obtain large amounts of sequence data for oligos with a few incorrect bases, but we will have a substantially more difficult time in getting high coverage of individual oligo species for any more than 2 or 3 incorrect bases per oligo. This means we’ll have to carefully consider how we’ll do the analysis.

Much more on this in the future...


  1. Oh no, I wrote a long comment with lots of ideas but lost it because my cable won't stay plugged in and my laptop refuses to use airport!

    Quick summary:

    1. Consider properties of libraries enriched by uptake, as this may be the real limiting factor. The above analysis is for the input library, and here we can probably just use sequencing to test the null hypothesis that the library and sequencing are not introducing bias.

    2. Reduce 32 to about 25, as not all bases matter.

    3. Can we do any preliminary experiments?

    4. Tell NIH we only expect to get data for, say, up to 5 mismatches to the consensus.

    5. I think there was one more point....

  2. Re: #1

    Before we can decide how well a particular oligo species is taken up in our enrichments, we need to know its abundance in the input pool. I’ll start by making the grossly over-simplifying assumptions of (a) no competition between different oligo species, (b) perfect periplasmic enrichments, and (c) no sequencing error. My first naïve idea on how to calculate an uptake efficiency of a given USS-like sequence is to measure the ratio of “uptake coverage” / “input coverage”. I’m sure there are more sophisticated ways of doing this, but this is the only way that’s somewhat clear in my head so far.

    Okay. So referring to the table above for 12% degenerate oligos:

    We could then use the uptake / input ratio calculated from the error-free oligo as our baseline USS uptake efficiency. “Input coverage” of this species is high (16,728 ppm), so we would expect a fairly precise number. For the sake of argument, I’ll imagine that “uptake coverage” / “input coverage” = 1, i.e. the relative amount we put in equals the relative amount we get out.

    For a given 1 mismatch oligo, we’d expect to have good numbers too. Our denominator (“input coverage”) in this case would be 760 ppm. So if we got only 76 ppm for “uptake coverage”, we’d say the base substitution in question caused a 10-fold decrease in uptake, relative to the non-degenerate oligo. Whereas if “uptake coverage” = 7600 ppm, we’d say it stimulated uptake 10-fold.

    On the other hand, for a given 3 mismatch oligo, we need to a sufficiently large denominator to make an accurate measurement. Our analytical calculation says we should see the 3 mismatch oligo in the input in only 1.57 ppm. This means that we would only be able to see 3 mismatch oligos that dramatically stimulated uptake (i.e. if we saw 157 ppm), but we would have a terrible time seeing unchanged rates (only 1 or 2 ppm) and almost no ability to detect 3 mismatch oligos that hurt uptake (i.e. if we see <1 ppm, we wouldn’t have much statistical power to detect this as decreased uptake efficiency). So even with only 3 mismatches, we’ve already kind of lost our ability to measure most things.

    (a) It's not quite as bad as I say, since we'd get a lot more than a million reads per sequencing lane, but the principle still holds.

    (b) Things improve significantly if we don’t actually care what the mismatches are. That is, if we only ask about whether the base is correct or incorrect, we only have to deal with ( n k ) classes, rather than ( n k ) * 3^K classes. "Coverage" improves by a factor of 3^k with a corresponding loss of meaning.

    (c) Things are also better if we only condition on single base changes, independent of how many substitutions are in a particular oligo. So we expect any given position to be a particular substitution 4% of the time, so across all oligos, we can ask how bad it is, say for position x to go from A -> T. If uptake was unaffected, we’d expect to see 4% of our uptake oligos to have this change. This limits the analysis, but is probably the easiest thing to do.

    (d) Things get significantly worse, if we consider competition. Things get really bad if we have to worry about greater than first order kinetics or any kind of concentration dependence with our competition. I think that using two different degeneracy rates will help address this, even if they’re somewhat close (i.e. 9% and 12%). This is because the concentration of individual species in the pools would be different by a predictable amount, so comparing uptake efficiencies for the same oligo species in the different oligo pools would help us tease this out.

  3. I’ll back up and simplify the problem. In the last comment, I failed to deal with the fact that the measurements are always relative to something (in this illustration a million sequence reads). Let’s imagine an input pool of three sequences (A, B, C) in equal abundance. Now imagine A is taken up 10X better than B, and B is taken up 10X better than C. So,

    INPUT: [A] = [B] = [C]

    For a million reads of input, we’d see:
    A: 333,333 ppm
    B: 333,333 ppm
    C: 333,333 ppm

    OUTPUT: [A] = 10 [B] = 100 [C]

    For a million reads of output, we’d see:
    A: 900,901 ppm
    B: 90,090 ppm
    C: 9,009 ppm

    So, for the ratio of input to output:
    A: 2.7
    B: 0.27
    C: 0.027

    And as expected, they have a factor of 10 between each.

    If they went in with unequal abundances, where:

    INPUT: [A] = 10 [B] = 100 [C]

    For a million reads of input, we’d see:
    A: 900,901 ppm
    B: 90,090 ppm
    C: 9,009 ppm

    OUTPUT: [A] = 100 [B] = 1000 [C]

    For a million reads output, we’d see:
    A: 990,001
    B: 9,900
    C: 99

    And the ratios:
    A: 1.10
    B: 0.11
    C: 0.01

    And again, a factor of 10 between each of them.

    The point is, the ratios don’t really mean anything, except relative to other species. The value of the ratio is to normalize away differences in the abundance of a given oligo species in the input to still arrive at a relative efficiency between species.

    This comparison demands that there is no concentration-dependent competitive effects.