Thursday, May 21, 2009

Sequencing, Ho!

In response to Rosie’s last post, I wanted to outline my take on two of our planned experiments that involve deep sequencing:
  1. Transform a sequenced isolate’s genomic DNA (sheared) into our standard Rd strain, collect the transformed chromosomes, and sequence the daylights out of them to measure the transformation potential across the whole genome.
  2. Feed a sequenced isolate’s genomic DNA to competent Rd, purify the DNA that gets taken up into the periplasmic space, and sequence the daylights out of this periplasm-enriched pool to measure the uptake potential across the whole genome.
As Rosie stated, Illumina’s GA2 sequencer is pretty incredible:

The GA2 can generate massive amounts of sequence data for a low cost. The technology is pretty involved, but in short, it involves a version of “sequencing-by-synthesis” from “polonies” (or clusters) amplified from single DNA molecules that have settled down on a 2D surface. (This reminds me tremendously of my first real job after college at Lynx, Inc., which developed MPSS). The original instrument could get ~32 bases from one end of an individual DNA fragment in a polony; the new GA2 instrument can get 30-75 bases from each end of an individual DNA fragment on a polony. The number of clusters in a given run is quite large (though apparently pretty variable), so in aggregate, several gigabases of sequence can be read in a single sequencing “run”, costing about $7000.

For reasonable estimates of coverage, we can use the following conservative parameters:
• 60 million paired-end reads per flow cell (7.5 million / lane)
• 50 bases per read (that’s a very conservative 25 bases per end)
That’s ~3 gigabases! For our ~2 megabase genome, we’d get conservatively get 1500X sequence coverage for a full run, costing ~$7000.

Apparently, if we’re doing things just right, this number would more than double:
• 80 million paired-ends per flow cell (10 million / lane)
• 2 X 50 bases per paired-end read
That’s 8 Gb, or 4000X coverage of a 2 Mb genome! (Or 500X coverage in a single lane)

(1) TRANSFORMATION: Unfortunately, a single run, as described above, is probably not quite what we’d need to get decent estimates of transformation potential for every single-nucleotide polymorphism of our donor sequences into recipient chromosomes. While some markers may transform at a rate of 1/100, we probably want to have the sensitivity to make decent measurements down to 1/1000 transformation rates. For this, I think we’ll still need several GA2 runs. However, we could get very nice measurements with even a few lanes for indel and rearrangement polymorphisms, using spanning coverage (see below).

But, in a single run, we could do some good co-transformation experiments by barcoding several independent transformants, pooling, and sequencing. So if we asked for 100X coverage, we could pool 5 independent transformants into each of 8 lanes (40 individuals total). This wouldn’t give us transformation rates per polymorphic site, but since we’d know which individual each DNA fragment came from, we’d be able to piece together quite a bit of information about co-transformation and mosaicism.

(2) PERIPLASMIC UPTAKE: On the other hand, even one run is extreme overkill for measuring periplasmic uptake efficiency across the genome, because in this instance, we don’t really need sequence coverage, but only “spanning coverage” (aka physical coverage). Since we can do paired-ends, any given molecule in our periplasmic DNA purification only needs to get a sequence tag from each end of the molecule. Then we’ll know the identity of the uptake DNA fragment, based on those ends (since we have a reference genome sequence). Sequence tags that map to multiple locations in the reference will create some difficulties, but the bulk should be uniquely mappable.

So, if we started with 500bp input DNA libraries, spanning coverage in only one single lane (rather than a full run of eight) will give us a staggering 2500X spanning coverage! (500 bp spans X 10 million reads / 2 Mb genome = 2500.) To restate: For <$1000, each nucleotide along the genome of our input would be spanned 2500 times. That is, we’d paired-end tag 2500 different DNA fragments that all contained any particular nucleotide in the genome. If we graphed our input with the x-axis chromosome position and the y-axis spanning coverage, in principle we’d simply get a flat line crossing the y-axis at 2500. In reality, we’ll get a noisy line around 2500. Some sequences may end up over- or under-represented by our library construction method or sequencing bias. In our periplasmic DNA purification, however, we expect to see all kinds of exciting stuff: Regions that are taken up very efficiently would have vastly more than 2500X coverage, while regions that are taken up poorly would have substantially less coverage. This resolution is most certainly higher than we need, but heck, I’ll take it. I would almost be tempted to use a barcoding strategy and compete several genomes against each other for the periplasmic uptake experiments. If we could accept, say, a mere 250X spanning coverage for our uptake experiments, we could pool 10 different genomes together and do it that way. We could even skip the barcodes with some loss of resolution if the genomes were sufficiently divergent, such that most paired-end reads would contain identifying polymorphisms... The periplasmic uptake experiment is our best one so far, assuming we can do a clean purification of periplasmic DNA. The cytosolic translocation experiment has some bugs, but should be similiarly cheap. The transformation experiment will, no matter how much we tweak our methods, require a lot of sequencing. But it's still no mouse.


  1. (I'm falling sadly behind in even reading this blog so the comment below only applies to the first few paragraphs..)

    George Weinstock said on Thursday that with Illumina they are presently getting 2x75 bp reads and 20Gb per run. He anticipates soon getting 2x125bp and 90Gb runs.

  2. Aww... I guess I DO need to be more succinct.

    From speaking to others that run GA2, these are optimal numbers and not reflective of an average run.

    Also I'm usually the last person to catch this sort of stuff, but "presently" ≠ "currently". The former refers to the immediate future, whereas the latter refers to the present day!