Wednesday, September 30, 2009

Struggling with the Background

Previously, I had shown some preliminary analysis of Rosie’s simulated uptake data of chromosomal DNA fragments. Rosie also sent me simulated uptake data of degenerate USS sequences (using a 12% degeneracy per position in this USS consensus sequence :


So what can I do with this dataset? Well, first, since Rosie also provided me with the “scores” for each sequence, I could plot a histogram of the scores for the 100 selected and 100 unselected sequences, showing that the uptake algorithm seems to work pretty well.

Here's the histogram:
Notably, even the unselected sequences have rather high scores, when compared to the same analysis of genomic DNA fragments. This is unsurprising, since the sequences under selection in this degenerate USS simulation are all rather close to the consensus USS.

Here’s the histogram from the genomic uptake simulation again (just to compare):
(I think the reason for the difference in the “selected” distributions is due to a different level of stringency when Rosie produced the two simulated datasets.)

That’s all fine and good, but now what? With the genomic dataset, I could use the UCSC genome browser to plot the location of all the fragments I was sent, but this consists of two alignment blocks of sequence that look markedly similar.

The obvious thing to do was to make Weblogos of the two different datasets… The unselected set should have very little information in it, while the selected set should contain information. In doing this, I discovered a rather important issue… The on-line version of Weblogo does NOT, I repeat, does NOT account for the background distribution.

This is a problem. It means that whenever you make a Weblogo (on the webserver) from your alignment block, it is assuming that each base is equally likely to occur at a random position. This is why the y-axis in all Weblogos plots always has a maximum of 2 bits when using DNA sequence. Why is this a problem? First of all, if one is using an AT-rich genome, as we are, then the information content of any G or C is underestimated and any A or T is overestimated.

So how does Weblogo calculate the information content of each position in an alignment block? From the Weblogo paper (link found at the Weblogo website):
Rseq = the information at a particular position in the alignment
Smax = the maximum possible entropy
Sobs = the entropy of the observed distribution of symbols (bases)
N = the number of distinct symbols (4 for DNA)
pn = the observed frequency of symbol n

The log2 is there to put everything in terms of bits.

So for DNA (4 bases), the maximum entropy at a position is 2 bits. Makes perfect sense: 2 bits a base. However, this only makes sense if each base is equally probable for a randomly drawn sequence. Now for purposes of gaining an intuition for different motifs, this isn’t really a big deal, although it does complicate comparing motifs between genomes.

When this isn’t the case (probably much of the time), then different measures have been used, namely the “relative entropy” of a position. This is an odds ratio of the observed probability and the background probability. Apparently, the off-line version of Weblogo can account for non-uniform base composition, but I haven’t tried installing it yet, nor any other software out there that handles variation in GC content.

Why? Because what we need for our degenerate sequences is a different background distribution at each position! So, the first position in the core is 88% A, but the third position is 88% G!

To illustrate the problem, here is a Weblogo of the selected set:
Here’s the unselected set:
Looking closely, it is clear that there are differences in the amount of “information” at each position. So in the strong consensus positions of the USS, the selected set has higher “information” than the unselected set, while at weak consensus positions, that’s less true.

But the scaling of each base here is completely wrong. There isn’t nearly a bit of information at the first position in the unselected set. We expected 88% A. The fact that there are mostly As in the alignment block at the first position is NOT informative. In fact, if all was well, we’d get zero bits at all the unselected positions!

What to do? I tried to make my own logo, using the known true background distribution at each position. I won’t belabor the details too much at the moment, except to say that I had to figure out what a “pseudocount” was and how to incorporate it into the weight matrix, so as to not ever take the logarithm of zero.

Here’s the selected set of 100 sequences:
Here’s the unselected set of 100 sequences:
(Note that I somehow lost the first position when I did this, so the motif starts at the first position of the core.)

This actually looks quite a bit better, or at least more sensible. A few things worth noting:
  • I think if we did thousands of unselected sequences, we’d pretty much get zero information from that alignment, which is what we would want, since that’s just the background distribution.
  • Some values are negative. This is expected. Since these are scaled to log-odds ratios, when the frequency of seeing a certain base is less than the expected background frequency, a negative number emerges.
  • The scale is extremely reduced. Every position is worth less than 0.3 bits. This is also expected. One description I’ve seen about how information content can be thought about is how “surprised” one should be when making an observation (there’s even a unit of measure called a “suprisal”!). Since we are drawing from an extremely non-uniform distribution that actually favors the base that’s expected to be taken up by cells better, we are basically squashing our surprise way down. That is, getting an A at the first position of the core is highly favored, but it’s the most likely base to get anyways, even in the absence of selection.
  • The unimportant bases in the USS have the most information content in the selected set. At first this bothered me, but then I realized it was utterly expected for the same reason as above. For example, at position #18 above (sorry, it’s position 19 in the Weblogos), the selection algorithm doesn’t really care what base is there. That means that the selected set will let mutations at that position (from A to something else) come through, which will be surprising, when compared to the background distribution!
(ADDED LATER: Actually, this last point is wrong. The reason for so much information at the weak positions is related to the matrix that was used to select the sequences, not from surprise. I'll try and get a proper dataset later and redo this analysis. To some extent, the positions will still have some information, as partially explained in my erroneous explanation above, but not nearly so much.)

Whew! I’ve gotta quit now. There’s a lot more to think about here.

Monday, September 28, 2009

Mismatch repair versus Segregation

Things have gone swimmingly with my strain construction plans, and indeed today I am extracting DNA that will presumably be sequenced. To recap, I made a couple of clinical isolates (86-028NP and PittGG) resistant to novobiocin (NovR) by transforming them with a bit of left-over NovR allele of the former postdoc. I then isolated the new strains’ DNA, and used these to transform the standard KW20 Rd strain. By selecting for NovR, we can be certain that the clones I pick took up DNA and recombined it into their genomes.

One technical issue arose, however, which required a little bit of thought: Should I have streaked for single colonies? I.e. once I had my transformants, it might be a good idea to streak out individual colonies to make sure I purified them away from any background or broke apart any doublet colonies. No big deal, but after talking it out with Rosie, we decided to skip it. Why? So that we might get lucky and distinguish recombination followed by mismatch repair versus recombination followed by segregation. In the following figures, I illustrate what I mean by this…

In this first one, the donor DNA is shown in red, and the recipient chromsome is shown in two colors, blue and green, to distinguish the strands. The lowercase letters indicate polymorphic sites in the donor genome. Little a is meant to be the selectable marker, in this case an allele of gyrB:
Donor DNA is incubated with competent recipient cells, and recombination of single-stranded DNA leaves patches of heteroduplex in the genome, shown as small red patches on either the blue or green strands.

After this, the cells have a chance to perform mismatch correction to fix any heteroduplex. I select for cells that have little a by plating to novobiocin plates, so only cells that end up a/a will survive an make colonies. (I am not going to show any examples of restoration repair, in which donor alleles are repaired back into recipient alleles… this will be invisible in our analysis.)

In the below example, I show the A/a and B/b heteroduplexes getting mismatch repaired into a/a and b/b, whereas C/c and D/d heteroduplexes remain unrepaired (they escape correction). What will happen in such as case is the generation of a sectored colony, in which (in principle) half the cells would have one genotype and the other half a different genotype:
In the above example, the original transformant segregates the c and d alleles into different cells, while a and b end up in all cells. If the whole resulting colony is grown up and sequenced, the a and b alleles will be the only ones observed, while at the other two loci, there will be a mix of C and c, along with a mix of D and d. We wouldn’t be able to tell “phase”, i.e. whether c and d were on the same or different chromosomes, unless we did streak for singles and the sequenced several clones. But as a first pass, this could be a really interesting analysis. It will also serve as excellent proof-of-principle for our more intense sequencing plans.

There is a caveat, however, which means we need to get a little bit lucky to be able to distinguish these phenomena (mismatch repair versus segregation). We won’t see two different genotypes, if the A/a heteroduplex isn’t mismatch corrected:
The issue isn’t that segregation didn’t happen; the problem is that one of the segregants dies under selection for little a.

Thus, if we see a pure genotype, then either all mismatches were corrected, or our selectable marker didn’t mismatch correct.

When I pre-screen my transformants to make sure they’re not spontaneous mutants, I might be able to pick a colony where I think segregation is occurring. If I get the standard sequencing traces back and see mixed bases in the chromatograms that corresponde to donor and recipient alleles, I’ll pick that kind of clone for sequencing…

One sort of sad note here, in terms of the more distant future, is that mismatch repair mutants, which should be quite useful for understanding transformation, will need to be transformed without selection if we hope to recover isolated segregants from individual transformants.

Friday, September 25, 2009

More simulated uptake

Thanks to Rosie eliminating the int function from her Perl model, I got to take a look at some more simulated uptake data. Last time, there were several issues, which now seem solved. This time to model uptake, she used the real genomic USS position weight matrix to stochastically select 500 bp fragments from the first 50 kb of the Haemophilus genome. I got 200 from the forward strand and 200 from the reverse complement strand, along with a set of random fragments. This is 4X spanning coverage of the 50 kb…

Below is the way the data looked in the UCSC genome browser, added as custom tracks (click on the figures to enlarge).
From the top, the tracks are:
(1) Chromosome position
(2) 400 random fragments (shades of brown).
(3) 400 selected fragments (shades of blue).
(4) Positions of “perfect core” USS motifs (5’AAGTGCGGT-3’) on either strand
(5) RefSeq gene annotations.

Here’s a bit of the 50 kb zoomed in:
That looks pretty good for such low coverage! (In our real experiment, we expect to get several hundred times more data.) It’s starting to look like a real model of how uptake might look! The random fragments look roughly randomly distributed, and the selected fragments clearly show a punctate distribution around the “perfect core” USS sites.

Here’s a histogram of the scores of the best site on a given fragment for the random and selected datasets:
Indeed, the distributions are quite distinct, though notably the distribution of random fragments looks bimodal. This may simply be a feature of the genome, since there are so many USS sites… Worth thinking about though.

There are other details obscured in the browser figures: the shading indicates the relative score of the best site on the fragment (on a log scale), and each fragment also has orientation shown as a small arrowhead within the box. I’ve also associated each fragment with its score. So in the browser, I can easily check things out more carefully:
In this zoom, it’s clear that there is an excellent site to the left (just under 18,000), a weaker site to its right (~18,400; fragments are overlapping with the left-most site; no perfect core), and a pair of sites on different strands with different scores to the right (~19250). I can also retrieve the sequence associated with a given fragment to see if I can spot the USS site within it. And if I really zoom in, the DNA sequence is listed at the top.

It’s not really so easy to see what’s going on with all these overlapping fragments, so my next task will be to convert this data from fragment positions to spanning coverage per chromosomal position (though I’ll probably bin positions, perhaps every 100 bp to keep things reasonably small for now). I will take a stab at doing this properly (with a script) but may wimp out and do it in Excel. If I can then muscle this data into a WIG formatted file, then I’ll be able to plot the data in a way where “good” sites will look like peaks in coverage…

Thursday, September 24, 2009

E-Z Strain Construction

As preliminary data for our genome-wide recombination analysis (outlined in this post from Rosie), we want to sequence the whole genome of a single transformed clone in the next couple of months. The idea is to transform our standard KW20 Rd strain with DNA from one of the other completely sequenced strains (probably 86-028NP, possibly PittGG), select a single transformed colony, and sequence its genome.

This will provide us with all sorts of useful preliminary results:
  1. Show that we can indeed handle the type (and amount) of data we’ll be obtaining.
  2. Estimate the total amount of donor DNA a single recipient recombines (and fixes) into its genome.
  3. Estimate the length of recombination tracts (gene conversions) / the strength of “linkage”.
  4. Estimate mosaicism of donor and recipient sequences (mismatch repair).
  5. Estimate the transformation rates for different classes of single-nucleotide differences (for example, the number of A->T transformation events observed versus the total A->T differences between the strains)

In particular, item (2) will be crucial for estimating the total amount of sequencing we would need to measure transformation rates per polymorphism across the genome. Simple transformation assays with DNA from the multi-antibiotic resistant MAP7 strain suggest that possibly 20-50kb of DNA may be replaced in a single transformant, but this type of analysis is restricted to only a few different sites in the genome and is very roughly calculated.

The analysis of a single transformed genome will still be preliminary with regards to (3)-(5), for which we will want genome sequences for several independent clones. In the future we are likely to barcode and pool independent transformants, since we expect that a single lane of Illumina sequencing will be overkill for a single Haemophilus genome of less than 2 Mb (250X sequence coverage).

Anyways, one issue with producing the material for this first sequencing experiment is that we need to make sure that the clone we select comes from a cell that was indeed competent and did indeed get transformed. Since only a fraction of cells in a competent culture are competent, we would be wasting a lot of time and money, if we accidentally just re-sequenced our recipient genome.

In order for this to work, we need our donor strain to carry an antibiotic resistance marker. By selecting for recipients that become resistant, we can be sure the clone we select took up DNA that got recombined into the genome. (This may also create a bias for donor alleles near the selected site, due to “linkage”.)

To this end, I am doing the following:
  1. Made a couple strains (KW20, 86-028NP, and PittGG) resistant to novobiocin. I just did this. It worked like a charm thanks to the former postdoc having a well-organized lab notebook and a well-organized freezer box containing a tube with a NovR allele of gyrB already prepared for me. This was also my first time doing overnight transformations. I couldn’t believe how easy it was: Add a frozen aliquot of cells and some DNA to some sBHI media, let the cells grow overnight, and plate them the next day. There were plenty of resistant colonies this morning.
  2. Prepare DNA from the newly produced 86-028NP NovR and PittGG NovR strains. I’ll do this tomorrow from the overnight cultures I just inoculated.
  3. Transform KW20 with this DNA. I’ll use competent cells I already have tomorrow, after my DNA prep.
  4. Saturday, assuming I have NovR transformants, I’ll pick and grow up some transformed colonies overnight.
  5. Sunday, I can prepare this DNA, and that’ll be what we can send for sequencing!
So if all goes well, we should have our material in a few days! Then we wait. Then the real work begins…

(As a side note, 86-028NP indeed appears to already be resistant to another antibiotic, nalidixic acid. I will check to see if this resistance is transformable when I have the 86-028NP NovR DNA in hand.)

Wednesday, September 23, 2009

Fake Periplasmic Data

UCSC Microbial Genomes Database was a nice find for me, since they host the Haemophilus influenzae KW20 genome. It has pretty much made me forget about my own plans to make a custom browser for the moment. Though that will change, as when we have our own data, we’ll absolutely need some off-line way to browse our datasets, since they will be so large…

As a first fake experiment to explore how our periplasmic DNA pools might look, Rosie sent me two sets of 200 sequences. One set was 200 randomly chosen 100mers from the first 10 kb of the Haemophilus genome, and the other set were 200 sequences (100mers) stochastically selected for the presence of a USS using her Perl scripts. All I had to do was turn her data into a BED formatted file, which only took a few minutes. As usual, I made the BED file using Microsoft Office, rather than a more savvy command-line way, which would've probably used Grep or something.

Here’s what her data looks like plotted as a custom track (squished) in the UCSC genome browser:

It looks like it sort of worked! There’s a prominent peak containing nearly half the sequences in the selected pool, while the random fragments look just like they ought to.

One issue here is that we know there are two other perfect matches to the core USS motif in the first 10 kb, and these weren’t captured by the selection algorithm. It’s slightly unclear why that is, but might have something to do with the USS position-weight matrix that was used. (Actually, there are six USS in the interval, but we were only searching one strand this time...)

A beginning!

Thursday, September 17, 2009

The Last Straw

Yesterday, Rosie kindly ran her Perl script over the USS construct I designed. The final thing I was worried about was whether or not my design had any USS or USS-like sequences in it, other than the one it's supposed to have. I'd checked the construct for any core USS motifs (5'-AAGTGCGGT-3'), but since we think that the motif is more complex than this, it was important to make sure that there were no extra sequences that got high scores using the USS position-weight matrix.
Fortunately, the construct looks good, so I can go ahead and order the control oligos and have high expectations that they'll work...

Here's how every 32 base pair window over the 199mer looks when scored with the USS PWM:
There's a single prominent high-scoring site right where it should be, and all of the surrounding area scores near background. The USS in the construct has a score (~10^-8) more than 10 orders of magnitude better than the next best sites. There's a slight increase for windows immediately adjacent to the USS, presumably because the AT-tracts in the USS are still contained in those windows. The rest of the construct only has scores at background.

Just to show that these other sites really do represent background levels of USS score, Rosie also ran a randomized version of the sequence:
Nothing better than 10^-18. Excellent.

Scale UP!

How much periplasmic DNA can I hope to get using my current protocols, and how much DNA will I need? As promised, here are some rough calculations regarding the oligo purchases that we want to make.

I used a molecular weight calculator available on-line to determine the size of the dsDNA I described in the last post.
I alternatively could’ve used Rosie’s Universal Constants (660 g / mol of base pair and 10^-18 g / single 1 kb DNA molecule) to make this calculation, but since I’m dealing with a known sequence, I might as well get an exact molecular weight. (I also made a minor mistake in the last post, and the molecule I describe is actually only 199 bp).

So, for our USS molecule, MW = 122,828.6 g / mol. And the oligo synthesis service we’re planning on using will be at the 1 micromole scale. That means if we took all of the two oligos, annealed, extended, and purified, we’d end up with 0.123 grams of input DNA pool! That’s really a very large amount.

My previous concerns about needing to do PCR to maintain the pool are unfounded. This scale should be sufficient for hundreds (or even thousands) of experiments...

What follows are my preliminary assumptions about yields from the periplasmic DNA prep. They are based on several different experiments, though I am erring on the side of being conservative with my estimates and are guides for future experiments only. In this post, I will address the issues of scale-up at the end; before that I’ll just refer to the approximate total culture volume and amount of DNA that I’d need to get a target amount of DNA, assuming all else works perfectly.

So, I’ve now done several experiments using a PCR fragment bearing the consensus USS, called USS-1. If I add 20 ng DNA / 1 ml competent cells, ~50% is taken up. That is, in rec-2 cells, my theoretical yield of periplasmic DNA is 10 ng. My actual yield is considerably lower; as evaluated by my radiolabeling experiments, I estimate I get ~25% of my theoretical maximum.

This means,

1 ml cells + 20 ng DNA → 2.5 ng recovered.
20 ml cells + 400 ng DNA → 50 ng.
40 ml cells + 800 ng DNA → 100 ng

But this is only for the consensus sequence. Our real experiments will be a mix of molecules, some of which will be efficiently taken up and others that won’t. For a cursory estimate, we might assume that ~50% of fragments will be “good” USS and the other half will be “bad”. This would further reduce the yield.

That means, I am likely to need ~80 ml cultures and a starting input DNA amount of ~1600 ng, just to get back a mere 100 ng of DNA back!

Most Illumina sequencing centers seem to want ~1 ug of DNA to make libraries, but a lot of ChIP-seq experiments seem to call for only ~100 ng. In our case, there will be no downstream library construction, so we can likely get away with small amounts of DNA, as long as it is quite pure and accurately quantified.

Regardless, this is going to take fairly large cultures, fairly large amounts of DNA, and a good scaled-up periplasmic prep.

BUT, one important thing to note is that our degenerate oligo preparation will be more than sufficient for a large number of experiments, even at this large scale. For the controls, I can merely buy minimum-scale synthesis long oligos at ~$200 a pop. Since I can safely PCR amplify these, I will be able to make a replenishable stock for use in scale-up experiments.

More on this in the future, but while I’m doing this, I might as well estimate what it will take to get a microgram of chromosomal DNA fragments out of competent cell periplasms.

My previous experiments with sonicated DNA gave pretty consistent DNA uptake measurements:

~50% of 200 ng 1-10kb DNA / 1 ml cells → 100 ng max. yield.
~10% of 200 ng 0.2-0.4kb DNA / 1 ml cells → 20 ng max. yield.

Given a 25% recovery rate from the periplasm, this means that for a microgram of DNA, I will need:

1-10kb DNA: 8 micrograms in a 40 ml culture
0.2-0.4kb DNA: 40 micrograms in a 200 ml culture (!)

This last is really asking a lot. That size of scale-up will require special thought…

Appendix on Scale-up Issues:
  1. Purity: I have not been adding RNase. I need to get all the RNA away, in order to accurately quantify the DNA. I am also concerned about salt. The CsCl in my DNA precipitates may not be getting washed out adequately by a single 80% ethanol wash.
  2. Cell concentration: It would help for technical reasons, if I could concentrate the cells quite a bit before doing the organic extractions. I have used a ratio of 1:1, cells : organic solvents. So a 1 ml competent cell prep (~a billion cells) gets mixed with 1 ml solvent. But I might be able to resuspend 10 ml of cells in 1 ml and then use 1 ml solvent. I just don’t know.
  3. DNA concentration: I want to make sure that I am saturating with DNA for my initial experiments, but I haven’t yet done a proper saturation curve to know what I should be using. This will decrease the total efficiency of DNA uptake, but my total yields will be higher, and I will be biasing things towards the best uptake sequences (which is a good place to start).
  4. DNase: I have not been treating cells with DNase prior to isolation. From what I can tell, this is not a problem, and the free DNA is washed away. But if I use very high DNA concentrations, I will probably want to use DNase, just to be sure I’m eliminating free DNA completely.
  5. Details, details: Scale-up is never quite as simple as just increasing the volume of everything. I will need to make sure that there are appropriate centrifuges, shakers, tubes, and everything else. Growth rates of cells and competence induction may be poor when going to larger volume cultures. I am also concerned about scaling up the organic extractions. It turns out that not all conicals are created equal; I’ve had disasters where the phenol has torn through the bottom of 50 ml conicals when doing large-scale organic extractions, depending on the brand of conical and rotor used. I’ll need to make sure that things like this don’t happen in advance before I mess up somebody else’s equipment!


Reverse engineering

Okay, so now that we’ve exposited all the brilliant experiments we’re planning to do while writing proposals, the actual reality of doing the experiments is starting to sink in. We’ve also managed to put down some fairly concrete goals for the next several months.

One of our experiments involves measuring the specificity of DNA uptake by naturally competent H. influenzae for fragments containing “the genomic USS motif”. The H. influenzae genome contains an abundant sequence motif, and fragments bearing it are taken up better than fragments that don’t. This “uptake signal sequence” was originally defined by its functional role in DNA uptake, but has since been characterized mostly by bioinformatics, with no direct uptake specificity data. The limited data from previous lab members suggests only an imperfect correspondence between the properties of the genomic motif and the specificity of DNA uptake.

The idea, then, is to feed competent cells small DNA fragments bearing a degenerate (highly mutated) version of the USS consensus sequence, recover those that are preferentially taken up, and sequence the resulting pool. USSs are ~32 bases, well within the reach of single-end Illumina reads, if they are positioned properly next to a sequencing primer.

I’ve previously discussed the expected properties of a degenerate USS pool. And though I think we need to consider this more, I will focus this post on the design of other parts of the construct that will allow us to circumvent subsequent sequencing library construction steps. Illumina sequencing uses specific sequences added to the ends of molecules to capture and sequence DNA of interest...

Properties needed for a USS-containing construct, where Illumina sequencing can be directly performed to sequence the USS:

(1) SIZE: ≥200 base pairs, dsDNA. 200 base fragments with USS are efficiently taken up by cells, and the size is sufficient for efficient cluster synthesis and sequencing using Illumina’s Genetic Analyzer.
(2) CAPTURE SEQUENCES: One end of a strand of each fragment needs to be able to anneal to one of the two “Flow Cell Primers” (FP) in the Illumina flow cell, while the other end of the same molecule needs to contain the reverse complement of the other FP.
(3) SEQUENCING PRIMER BINDING SITE: The reverse complement of Illumina’s sequencing primer needs to be immediately downstream of the reverse complement of the USS. (This could work the other way, but getting the “sense” USS directly from the sequencing reads seems optimal).
(4) TAG SEQUENCE: The first few (four) bases of each read should be in non-degenerate fixed sequence to facilitate the alignment of the degenerate USS reads.
(5) CONSTRUCTION: After consulting several oligo makers, we learned that we wouldn’t be able to get our degenerate constructs built into an oligo longer than 130 nt. This means that I will need to anneal two oligos together and extend with polymerase to generate a full-length construct.

The first trick was to actually find out what the normal Illumina adapter and primer sequences were. They were available on-line, and I think I’ve mostly reverse-engineered what the different bits do. And think I have a reasonable design:

I’ll order two oligos, one 130 nt and the other 106 nt. (At the end of this post, I will list the exact sequences of each part and some notes.) They’ll have 36 bp of reverse complementarity at their 3’-ends, so that I can anneal them and extend to produce full-length construct.
To illustrate what all the different parts of the construct are for, here’s a color-coded version, for which I’ll schematically diagram the Illumina cluster synthesis and sequence priming.
The key features are that the flow cell primers (FP) are on opposite strands on opposite ends and the sequencing primer (SP) sits adjacent to the USS (with the 4-bp tag at the beginning). I am using plasmid sequence present in the lab’s other USS constructs for the Gaps (1 and 2).

To sequence the 200mer (either before or after recovery from competent cell periplasms), the DNA would be melted and annealed to an Illumina flow cell. Below are shown two different parts of a flow cell surface, where the two different strands of a single molecule might anneal.
DNA synthesis from FP1 or FP2 generates a covalently attached version of each strand.
The original molecule is melted off and washed out of the flow cell, and a special in situ PCR method generates clusters of single-strands covalently bound to the flow cell surface. In each cluster, the strands are oriented in both directions.
Sequencing then proceeds from SP binding sites. In this design, the SP binding site will then read the complement of the USS (with the first four fixed bases), so the actual sequence generated would be the USS contained in the construct.
There are several small details to go over to make sure that this design will work. Because the oligos are so expensive, and the degenerate oligo will be precious, I also plan to buy several non-degenerate oligos corresponding to perfect consensus, randomized, and mutant USSs. These will act as controls for the annealing/extension step that generates the uptake substrates and as controls for measuring saturation curves to optimize the appropriate DNA uptake conditions. I will also be able to do PCR to regenerate the control constructs, while I should probably avoid amplifying the degenerate USS construct for fears of strongly biasing the representation of different sequences.

NEXT UP: Uh oh… What about yields? Dimensional analysis…


The different parts of the two oligos:

Notes on my reverse engineering:
  1. FP1 (25 nt): Composed of putative 20mer FP1 + first 5 bases of one adaptor (calling it A)
  2. SP1 (33 nt): Sequencing primer for single-end Illumina runs. Includes the 13 bases of the normal adaptor that normally results in a 13 bp inverted repeat palindrome on either side of adapted DNA fragments.
  3. USS (36 nt): Includes 4-base tag (ATGC) upstream of a 32-base genomic Gibbs consensus sequence with a set level of degeneracy at each position.
  4. G1 (36 nt): Additional sequence from pGEM7f ,corresponding to the portion of the spacer region where the two oligos are intended to anneal.
  5. G2 (46 nt): More sequence from pGEM7f, corresponding to the spacer region only on one of the two oligo.
  6. FP2’ (23 nt): Composed of the complement to the 20mer FP2 + first 3 bases of the other adaptor (calling it B).
  7. Total length after annealing and extension is 200 bases, where the USS is located from position 63 (after the spacer) to position 94. In the flow cell, the use of SP1 as a sequencing primer should read the complement of the USS sequence, so the actual sequence obtained will correspond to USS (with the first four bases always ATGC).


Wednesday, September 9, 2009

Eating chromosomal DNA fragments

Haemophilus influenzae cells will take up closely related DNA from the environment quite efficiently, when they are made naturally competent by resource limitation.

Previously, I had done some experiments using sonicated chromosomal DNA of two different size distributions. The take-away lesson was that, for a fixed DNA concentration, larger fragments were taken up better than smaller fragments. This could be due to two non-exclusive reasons:
  1. Larger fragments are more likely to contain an uptake signal sequence.
  2. The uptake machinery is saturated when I used the smaller fragments, since there are more fragments per unit mass.
I am not certain of the best way to measure the relative contributions of these two factors to the observed disparity in uptake, though I’m pretty sure a saturation curve would be the way to start things off, that is measuring the amount of uptake over a wide range of DNA concentrations.

But first, and more to a practical concern for our sequencing plans...

I repeated this experiment, but also prepared total DNA and periplasmic DNA (by the slick method of Kahn et al) to make sure that I could cleanly recover chromosomal DNA fragments trapped in the periplasm from bulk chromosomes, as I previously showed for a small USS-containing PCR fragment.

Here are the results of that experiment (in which I provided ~0.5 billion competent cells with 200 ng of end-labeled DNA fragments of two different size distributions, either 1-10kb or 200-400 bp, for 30 minutes):
In (a), the % uptake is clearly better for the larger size distribution than the smaller size distribution. In (b) and (c), I show that I can purify periplasmic chromosomal fragments away from the cell’s chromosome. In (b), the results using the larger fragments is shown, while in (c) the results using the smaller fragments are shown. (I ran gels with two different agarose concentrations to optimize the separation for the two different input pools).

One thing to note is that the size-distribution of DNA between the input and periplasmic preparation were effectively indistinguishable. I looked at traces of these lanes in the Molecular Dynamics ImageQuant software, and they looked pretty much exactly the same. This is a little bit confusing, given the two models discussed above and the fact that fewer small fragments were taken up compared with larger fragments. I might have expected that there would be a bias towards the larger fragments in the periplasm compared to the input, but this was not the case.

Another thing to note is that, unlike when I previously did this experiment with USS-containing PCR fragments, there is still evidence of periplasmic DNA in wild type after 30 minutes. I don’t think this is due to poor washing of free DNA away from the cells, but rather reflects that there had been insufficient time to translocate all of the DNA in the periplasm into the cytosol. There is also the possibility that some of the non-chromosomal DNA in the wild-type samples are indeed cytosolic, which I can’t tell without some way to distinguish ssDNA and dsDNA.