Saturday, December 11, 2010

Sequence submission, ho!


Hi blog! Long time, no update…

On Thursday, I submitted 91 genomic DNA samples to our local genome centre (center for you American reader). They’ll be sheared up, converted into indexed (barcoded) libraries, and sequenced on an Illumina GA2 over 4 “lanes”. Whew! That should have been easier.

I won’t get the data until probably around February, so with that out-of-the-way for now, I’d best get my act together. But before that, here’s what this first round of sequencing for our Genome BC grant is about:

We want to get a basic genetic picture of what happens when we do our standard transformation experiments; we lack simple rules of genetic transmission by natural transformation. How many DNA fragments do competent cells add to their chromosomes by homologous recombination? How long are the segments? How does genetic divergence affect transformation?


Haemophilus influenzae KW20 cells can be made naturally competent by starvation of log-phase cells in M-IV media. A fraction of the cells in the culture becomes competent, able to take up added DNA fragments into their cytosol (in H. influenzae, uptake is biased to fragments containing USS, abundant motifs in H. influenzae chromosomes). Taken up DNA fragments can recombine with the cells’ chromosomes, if sufficient sequence identity between the fragment and part of the chromsome allows it.

We added genomic DNA from an antibiotic-resistant derivative of a divergent clinical isolate NP (86-028NP NovR NalR) to competent KW20 cultures in three separate experiments. In each case, we isolated antibiotic-resistant and –sensitive clones, grew these up in overnight cultures, saved frozen stocks, and extracted genomic DNA from the rest of the cultures for sequencing.

In these experiment, we are not getting a particularly “natural” view of natural transformation, since the donor DNA we’re using is purified chromosomal DNA from a single divergent isolate. Since we know effectively nothing about the actual DNA in the natural environment of naturally competent H. influenzae—its source, size-distribution, or associated muck—this seems like a good place to start. We know the genome sequence of the donor and can use the genetic differences between donor and recipient to identify transforming DNA segments.

Based on our preliminary work, we think that these transformed clones will contain long(ish) segments from NP, replacing segments of the KW20 chromosome. We’re using multiplexed Illumina GA2 paired-end sequencing to identify recombinant segments in a large set of clones from these three experiments. We won’t get perfect genome sequences out the other end; instead we expect a median read depth of ~60 per clone, enough to probably capture the vast majority of diagnostic polymorphisms in each clone. Then we’ll look at the distribution of recombinant segments across the independent transformants. That’ll be the Science part.

Here’s what the 91 samples were:

Selected set: 72
Because we estimate that only about 10% of the cells in each culture was actually competent, we selected for either novobiocin or nalidixic acid resistant clones (3x12 of each) to ensure that the clone was derived from a competent cell. To allow for selection for transformed cells, the NP donor DNA was purified from a clone made NovR and NalR by PCR-mediated transformation. This means that for a given selected clone, a recombinant segment is always predicted that spans the appropriate locus. At the two selected loci, we will be able to ask about “LD decay” away from the selected positions. We also expect a large number Independent recombinant segments in these clones; these will provide basic information about the distribution, size, and breakpoints of recombination tracts in natural transformants.

Unselected set: 8 pools of 2
While we estimate percent-competence by evaluating the “congression” (unexpectedly high co-transformation) of “unlinked” markers (not on the same DNA fragment), this calculation relies on an assumption about competent cultures, namely that cells come in only two flavors: either non-competent, unable to take up DNA, or competent, able to take up several long fragments of DNA. It could also be that “congression” arises from a more quantitative distribution of states, where cells are more or less competent; i.e. able to take up and recombine a variable number of fragments.
For the most part, the use of “congression” to evaluate the overall competence of a culture is probably valid, but in terms of generalizing from the above large set of selected transformants, we’d like to know how well this assumption holds true. Because we expect that most unselected clones will look just like the recipient KW20 chromosome, we pooled 16 unselected clones into pairs and include these pools as 8 of our submitted samples. Since we predict that only 10% of cells in the cultures was competent, our null hypothesis is that 1-3 clones will look like selected clones (i.e. a handful of independent recombination tracts) and the rest will look just like KW20. We may, however, find that many more clones show evidence of transformation, but containing only one or only very short recombination tracts.

Selected late-log set: 8
Similarly, we also included 8 NovR clones selected from a late-log transformation. Cultures of cells in late-log are considerably less competent than MIV cultures (~1-2 orders of magnitude), and Rosie showed me some old data in her notebook suggesting that this could roughly be accounted for by percent-competence. The suggestion is that the low transformation frequency of late-log cultures vs MIV cultures is only because fewer cells are competent, rather than late-log competent cells being less transformable. Sequencing a handful of transformants from a late-log culture will help sort this out, in a similar fashion as sequencing unselected clones from MIV cultures.

Controls: 3
We included both the donor and recipient genomes as samples; while we’ve already obtained ridiculous coverage of these samples, including them here acts as an internal “coverage control”. We also threw in some MAP7 DNA, which is a KW20-derivative with a bunch of antibiotic-resistances. While it’s going to look almost like KW20, we use it on a practically daily basis, so it’d be nice to see its sequence.

So that’s the set of 91 clones we’re having sequenced with our first round of Genome BC cash. I’m keeping my fingers crossed that the data is good and abundant and reveals new things…

Saturday, August 21, 2010

It's more delicious because it's more nutritious

I think I got somewhere with my degenerate sequencing data... possibly even discovered something unexpected. The first thing I had to do was more or less start over my analysis, and temporarily forget everything I’ve read about “information”, “bits”, “logos”, etc. and just go back to thinking about the experiment from the get-go...

When it comes to DNA, H. influenzae are picky eaters. They seem to find some DNA fragments tastier than others. When I came to the lab, Rosie provided me with a DNA sequence: AAAGTGCGGTTAATTTTTACAGTATTTTTGG. She told me that DNA fragments containing this sequence (the “consensus USS”) are especially delicious to cells, and she wanted to know why.

Well, before I can aspire to helping with that, it'd be helpful to know which parts (or combinations of parts) of this special sequence actually make it delicious. Are all positions equally important? Do different positions make independent contributions to deliciousness, or do they work together to make an especially tasty morsel? Perhaps such insight into H.flu’s palate will get me closer to the bigger project of understanding why H.flu is so picky.

I did the following experiment last month to this end:

(1) I provided a massive smorgasbord of DNA to a whole bunch of cells... hundreds of culinary choices per cell to a billion cells. This tremendous buffet also had ridiculous variety. Every DNA snack was based on Rosie’s extra-tasty sequence, but there were differences in every bite (due to incorporating 24% degeneracy at each base during DNA synthesis). Indeed, there were hundreds of billions of different taste sensations in the input DNA pool.

(2) I harvested the undigested meals from the cells (facilitated by a mutation that kept the food from getting digested). The cells probably had terrible heartburn at that point, but each on average managed to stuff down dozens of DNA molecules.

(3) I sent two DNA samples to my friend to get sequenced: a bit of the original DNA buffet, and the DNA that I purified from the cells’ guts. By comparing the relative abundance of different DNA molecules, we can learn which positions are important to making the sequence tasty and, hopefully, how the different positions work together to impart deliciousness.

(4) Now I have two giant lists of >10 million sequences each from my friend. So far, I’m working with the first 100,000 of each dataset so I can figure out how to process the data. For the below analysis, this is more than sufficient to make accurate frequency measures.

First, I asked how the uptake sample (the H.flu guts) compares to the input sample, treating each position in the sequence independently. To keep things extra simple, I only considered whether a given position on a particular sequence read is matched or mismatched from the original sequence.

So, for each position, I counted how many of the 100,000 reads in each set were mismatched. To calculate the relative enrichment of the mismatched sequence in the uptake sample compared to the input sample, I simply took the ratio of uptake:input, then took the log of that ratio. This normalization makes it so that positive numbers indicate enrichment between uptake and input, while negative numbers indicate depletion. Here’s what I get:
Lovely. It looks like some parts of the original sequence are more important than other parts. For example, sequences that didn’t have a C at position 7 were depleted more than 64-fold in the uptake sample compared with the input sample. This must be an especially important flavor. On the other hand, some of the other positions look like they contribute less to tastiness, and a few positions don’t seem to contribute at all.

Great! We could now go and compare this to some other information we have, but for now, I’m going to ignore all of that and move on to a pairwise mismatch analysis. This will tell us how pairwise combinations of mismatches affect tastiness.

First, here’s that first barplot again, with the bars re-colored by the amount of depletion. Mismatches at red positions make the DNA taste bad, while mismatches at blue positions don’t.
That’s the legend for this next image, which shows the log2(uptake/input) for each pair of positions. To make the dataset, I counted how many of the 100,000 reads in each set had mismatches at a particular pair of positions. As with the above, I did not condition on the total number of mismatches in the read, just that at least those two were. On the diagonal is shown the consensus sequence and values corresponding to the above barplot. The plot is symmetric about this diagonal.
The big cross of red is like what we saw in the 1D barplot above... Those same important positions are still important. The cross tells us that, for example, not having a C at position 7 means you taste terrible, independent of where else you have a mismatch.

BUT, Hot damn! I think there’s a discovery here beyond that delectable flavor packet of GCG at positions 6-8 (and possibly a few flanking bases).

Note the 3x3 grid of regularly-spaced "mismatch interactions. Note the spacing between these 3 interacting parts of the sequence are just about a helical turn of DNA apart (10-12 base pairs).

I think this picture has some implications for our molecular models...

brain whirring...

(Notes: After getting the number-crunching steps nice and tidy, I spent an inordinate amount of time making these plots. I started using heatmap(), graduated to heatmap.2(), but ultimately took it to the next level and went fully customized with the image() function. It also took a ridiculously long time to figure out how to have fine-control over the color-scaling. This was necessary to be able to make the legend. I also made a more normal-looking legend (below), which was quite tricky to do right... If I wasn't so distracted by helical turns at the moment, I might explain how I made these in more detail.)

Tuesday, August 17, 2010

Direct uptake specificity measurements

So I got a bit distracted from our plans with the Genome BC dough, since I got some more sequencing data! Woo! I'll return to the design of our 92 clones worth of sequencing later. For now, I am trying to work through some more grant writing, incorporating our findings from the new sequence data.

Below, I'll (try to) quickly summarize what I've found with my (very) preliminary data analysis and then try to describe the thing I'm having a hard time writing clearly about...

First off: It worked! My scheming panned out quite nicely. I got ~15 million single-end sequence reads from each of two DNA pools. One was from the input 24% degenerate USS-containing fragment. The other was from the periplasm-purified DNA fragments. The experiment and its outcome were described here and here.

Filtering the data:

The first thing I had to do was somehow parse the raw data file. For my first pass, I used a quick and dirty, but probably quite stringent, quality filter:
  1. I excluded any read where the four first and six last bases were not exactly what they should be. (These sequenced positions were supposed to be non-degenerate, so should have zero mismatches from the expected input sequence.) This excludes reads with known "bad" bases and forces the alignment to be correct. The unfiltered data contained a bunch of reads that clearly had insertions or deletions within the USS, probably due to the limitations of the original oligo synthesis.
  2. I excluded all reads that contained any Ns. This makes the analysis a bit simpler, especially when calculating the number of mismatches in the degenerate USS from consensus.
This left me with ~10 million reads in each dataset. That's a lot of loss, and I certainly will try and return to those sequence reads, but for now, leaving them out is a decent filter that will simplify the analysis (mainly by forcing a "true" alignment).


I piped a few greps together to pull all the sequence reads matching my criteria (the first four bases ATGC, the last six bases GGTCGA, and no Ns). From a UNIX terminal:
grep ^ATGC sequence.txt | grep GGTCGA$ | grep -v N > newfile.txt
That's it! (Someday, I should probably do something with all the quality scores, but for now, I'm ignoring them. This operation also lost all the read IDs.)

I soon found that when analyzing these new files that my poor computer-foo is making very slow-running scripts, so I took the first 100,000 reads from each filtered dataset onwards with a simple:
head -n 100000 newfile.txt
Making a Position Weight matrix:

I next calculated the frequency of each base at each position for the two datasets (generating two position weight matrices, or PWMs). Luckily, I have previously worked out how to use the input PWM to background-correct the periplasm-purified PWM here and here.

Details: On to some R...

I read in the data with scan. I've usually used thing like read.table or read.csv, but scan looks like the way to go with a big giant vector:
sequences <- scan("newfile.txt", what="character" )
Next, I calculated the per position base frequencies:
basesread <- 42
newmatrix <- matrix(, basesread, 4)

for (i in 1:basesread) { newmatrix[i, ] <-
table(substr(sequences, i, i))

Hmmm... Would've been nice to avoid that loop...

From the two matrices, I got a background-corrected periplasmic uptake specificity motif that looks like this:
This excludes the fixed bases and one other anomalous base at the beginning of the read.
(Sorry I can't use Weblogo or other things that make nicer logos, as none accomodate position-specific base composition.)

Hey, look at that C at position 7! It just sticks way out! That must be a particularly important residue for uptake. A previous postdoc and graduate student both had measured uptake efficiency for fragments with point mutations in a consensus USS. This data offers a great independent validation of this basic result. Here's what that data looks like:
It looks like the two approaches are giving comparable results. Fabulous!

The next step:

So what else can we learn from this much larger deep sequencing dataset? I'm sure all sorts of things. For now, the only other thing I've done is calculate the number of mismatches between each of the 100,000 filtered sequence reads and the consensus USS the degenerate construct was based on.

Here's what the distribution of mismatch number looks like for the two datasets:

Awesome! Shifted, just as previously simulated!


At first I tried a bunch of extremely laborious and slow ways of doing this. Then I found a package called "cba" with a function called "sdists", which can do edit distances with vectors of character strings. (The sdists step is how I murdered the computers when I tried using the full dataset.)

consensus <- "ATGCCAAAGTGCGGTTAATTTTTACAGTATTTTTGGGTTCGA" distances <- as.vector(sdists(consensus, sequences, weight=c(1,0,1) ) )

I guess I could just do it in chunks for the whole dataset to save on RAM...

Anyways, this vector of distances is going to serve as an important index to the sequence reads for the next set of analyses (which will look at co-variation between positions and how specific mismatches affect the motif).


Okay, now to try and describe the thing I'm having trouble saying in a succinct way. These information content measurements, as in the above motif, are in bits. This is "information content", but can also be thought of as "surprise". This is why the USS motif derived from the genome has so much higher information content than the new experimentally determined uptake motif. (Here's the genomic USS:)
Because most of the bases at a given position in the degenerate USS pool were already the preferred consensus base, little "surprise" was elicited when the preferred base was enriched in the periplasm-purified pool. As measure in bits. That's why the scale is so different between the two motifs.

Anyways, the issue arises when thinking about some other possible experiments. For example, if uptake specificity were partially modified (as we think we can accomplish using engineered strains), the positions with a new specificity would become extremely informative relative to the other when using the normal degenerate fragments. But if a new construct were designed based on the altered specificity, these positions would no longer have such elevated information content relative to the other bases...

This isn't actually a bad thing--in fact we can exploit it--it's just something hard to describe clearly...

Wednesday, August 11, 2010

How to spend some dough?

We have some money to spend (kindly provided by Genome BC) for identifying recombination tracts in individual transformants, where donor genomic DNA from the clinical isolate NP is used to transform Rd competent cells (NP-into-Rd transformations). This will help us answer some very basic questions about natural transformation: What are the numbers and sizes of recombination tracts in individual transformants? Are different parts of the genome equally transformable? Do mismatch repair and other degradative mechanisms limit transformation?

We already have a few clones worth of data (which is now mostly analyzed), but we need more clone sequences to do any worthwhile statistics. Actually, I've made nicer pictures and done some re-analysis for the poster I took to the Evolution meeting. I should post some of that sometime...

We’d originally had a cost-saving plan of sequencing overlapping pools of clones to obtain data from 64 clones. We even came up with a slicker plan (since sequence yields keep increasing), in which we'd get 128 clones worth of data, and 64 of these would be "de-convoluted":
But now we’ve discovered that our local genome center provides inexpensive indexing of DNA samples (i.e. barcoding). This means we can obtain 92 clones worth of data for only ~$15K. Wow!
So now, I need to collect the clones for sequencing and extract their DNA. The question is what clones? I will be sequencing 4 sets of 23 clones (plus a control), so might think of the data collection in that manner. (This number of indexed clones per pool will be sufficient to obtain ~50-100 fold median coverage per clone, given current estimated yields.) Three are two things to consider: (a) technical concerns, and (b) using some clones to look at different kinds of transformations.

This post will cover the basic technical concerns, and I’ll use the next post to discuss some alternative things we might do with some of this sequencing capacity.

To deal with potential criticism later, I should collect the NP-into-Rd transformants from three separate transformation experiments using three separate competent cell preps. This will allow me to make statements about how reproducible the distribution of recombination tracts between transformations is. Because each replicate will only have a couple dozen clones, this will likely not be sufficient for detecting subtle differences between the independent replicates, but will be sufficient in determining overall consistency of different transformations

Selection for transformants:
Since only a fraction of cells in a competent culture appear to be competent (as measured by co-transformation of unlinked marker loci), I will again select for either the NovR or NalR alleles of gyrB and gyA (respectively) that I’ve already added to the donor NP strain. This selection step ensures that every sequenced clone has at least one recombination tract, thereby selecting against clones derived from non-competent cells.

With indexing, it shouldn’t matter how I organize the clones submitted for sequencing, but this data can also be analyzed without paying attention to the index (for which I have my reasons), so it would make sense to organize a plate like this:

We’d originally decided 64 additional clones (atop the 4 we’ve got) would be sufficient for a basic picture. Now we can do 92 for much less money. Should we then do 92 NP->Rd selected transformants instead of 64, or should we use some of this new sequencing space to investigate new things?

Doing more clones would give us greater statistical power, but the more we do the more get will get diminishing returns. Since we estimate that ~60-70 clones will be sufficient for a nice first look, maybe we don’t gain much by doing 92 of the “same” thing and would be served better by using some of the extra space for other endeavours.

For example, if we reserved one set (23 clones) for one or more other DNA samples, we’d still get 69 clones worth of recombination tracts. What could we do with this extra space (keeping in mind that these should be transformations that can be done immediately)?

The most important factor to consider with each of these is whether the sample size (23 clones) would be sufficient to obtain useful data, despite having given up some statistical power for the primary set of 69 clones. (In my mind this means what we get should publishable as is alone or with the primary dataset, but could also mean several small pilot experiments for “preliminary data”.)

There are a whole lot of things we might do with this space… (stay tuned)

Tuesday, August 10, 2010

The anticipation is palpable...

We’re resubmitting a grant soon, in part of which we propose to measure the specificity of DNA uptake in H. influenzae for the “USS motif”, a ~28 bp sequence motif that is sufficient to mobilize efficient DNA uptake. One of the things we want to know is just how the motif lends itself to efficient uptake by competent cells. In previous work, the lab has shown that point mutations at positions in the motif with very high consensus sometimes affect uptake efficiency, but other times do not.

Just before leaving for vacation, I managed to submit exciting samples to my friend for sequencing. These DNA samples are being sequenced AS WE SPEAK!!!!!!! RIGHT NOW!!!

The two samples were: (1) a complex mix of 200 bp fragments containing a degenerate USS with a 24% chance of a mismatch at each base of a consensus USS 32mer; and (2) an enriched fraction of these fragments taken up by competent cells into the periplasm. My previous post on my pilot experiment describes the experiment in more detail with additional links to other posts on the degenerate USS.

In brief, I am trying to compare a complex input pool of DNA fragments with what is actually taken up by cells to make a new uptake motif, based on the uptake process itself, rather than inferring it from genome sequence analysis.

This is the uptake saturation experiment I did with the consensus USS (USS-C) and the degenerate USS (USS-24D):
As I've seen before, USS-24D is not taken up as well as USS-C, since there are many suboptimal sequences that are inefficiently taken up. At very high DNA concentrations, similar amounts of USS-C and USS-24D are taken up. This is presumably because there are enough optimal fragments to saturate the uptake machinery at very high USS-24D concentrations.

I purified the periplasmic DNA from three of these USS-24D uptake samples: SUB (10 ng/ml), MID (70 ng/ml), and SAT (508 ng/ml). Because the yields were quite low, I used PCR to produce more material from these periplasmic DNA preps. It took a little effort to optimize this PCR. Perhaps I will get back to the weird artifacts I discovered at some later point, but suffice it to say, I ensured that I had a clean PCR and did few rounds of PCR, so that I wouldn't alter the complexity of the library too much.

I then took this periplasmic DNA (which should be enriched for optimal sequences) and re-incubated it with fresh competent cells. If the experiment worked (so the DNA samples are worth sequencing), then the periplasmic USS-24D prep should be taken up better than the original input. Indeed this was the case:
Note, this experiment did not take the saturation curve as far out as the original, due to limits on the number of samples I could process in a single experiment and limits on available material before leaving for vacation.

Unfortunately, I didn't see an appreciable difference between the three different periplasmic preps (SUB, MID, and SAT), which I'd hoped to. I had originally thought that periplasmic DNA recovered for unsaturated uptake experiments would include more suboptimal sequences, while saturated experiments would yield only optimal sequences. I hoped that this contrast would allow me to investigate competition between different sequences. Oh well.

So I decided to send only two samples for sequencing: the input and MID. These are the ones BEING SEQUENCED RIGHT NOW!!! WOO!!!

It's still possible the different purifications would behave differently at the higher end of the saturation curve (i.e. perhaps SAT would saturate sooner than SUB). So it's probably worth running some more saturation curves to see if its worth sequencing the other samples. It'd be really nice to have an actual experimental condition changing to get the most accurate depiction of uptake specificity.

SO, what do I need to do next?

(1) Prepare for the coming data deluge: I think I know how to start the data analysis once I've processed the raw read data into a comprehensible ~2e7 rows X 32 columns (though probably in a computationally slow fashion... meh). What I'm less prepared for is the initial processing of the data. I designed the experiment to sequence 4 non-degenerate bases upstream and 6 non-degenerate bases downstream of the USS, so I will probably do a simple crude quality filter, where I demand the first 4 and last 6 bases are exactly correct. This will tend to eliminate poor reads and force the alignment of the 32 degenerate bases in the middle to already be correct. This filter will exclude erroneous oligo synthesis or fragment construction that introduced indels into the USS, which will simplify things initially. At this point, I will also need to apply a base quality filter to ensure I ignore base calls that have low confidence. Sequencing error is a problem for this analysis, so its important I use stringent filters. Even if I only ended up with 10% of the raw reads, I'd have an enormous amount of data for sequence motif analysis.

(2) Prepare for abject failure. It's possible I screwed up the design of the constructs or that there is some unanticipated challenge sequencing through those 32 bases or using this approach. I'll know in a few days, but need to think about what the next step would be, if indeed I misplaced some bases in my reverse engineering scheme.

(3) Prepare for a seeming failure that isn't really. I may have screwed nothing up but get back data from Illumina's pipeline that says something's terribly wrong. I don't fully understand the details of this, but the base-calling step in Illumina's pipeline (which is reading the raw image files from each sequencing step) may be screwball, because of the extremely skewed base composition my constructs will have during each cycle. E.g. the first base for every single cluster should be "A", whereas 76% of clusters should have "A" at the fifth base. Apparently, this can create some weird artifacts in the data processing step, which I need to be prepared for (and not despair when I find out "it didn't work". Mostly this will involve working with my friend and his sequencing facility to re-run the base-calling with an alternate pipeline.

(4) Work on updating the grant for re-submission. There are several paragraphs of our grant application that need to be modified to show our progress. To some extent, this will involve the soon-to-come data, but I can begin by identifying the parts that will need changes an including some different figures.

(5) Work out the molecular biology. I've done a bunch of uptake experiments in a bunch of settings, and I should now have enough to put together some kind of little story, even without the sequence data. With a mind towards a paper, I need to work out just what I need to do. Do I need to do another sequencing experiment with a different construct (more or less degenerate) or under different conditions? Or is a single periplasmic enrichment enough? If the latter, i will certainly want to show a bunch of other experimental data, but which experiments? And which need to be repeated. My first step in this direction is to go through my notebook and figure out what I have...

(6) Work out a large-scale periplasmic prep. I circumvented this for the degenerate USS experiment by doing PCR. SInce I know my exact construct, I could use PCR to amplify it and didn't need to have a good yield, nor particularly pure of a prep (since genomic DNA won't amplify using the construct's primers). However, if I want to look at uptake across a chromosome, I will need to fractionate periplasmic DNA away from chromosomal DNA both to a high level of purity and with a high yield. I've accomplished each of these individually, but so far have not managed a large-scale periplasmic prep that leaves me with enough pure DNA to make sequencing libraries reliably. I refuse to use whole-genome amplification for this experiment.

Hi blog!

So I’ve returned to work after a 3.5 week vacation to Bali! Woo! Time to catch up on some long non-blogging. But before that...

Here’s where my crazy Uncle Jimmy took me and the missus for a few days… Gili Gede (meaning “Little Island Big”) off the coast of Lombok (which appears to be experiencing a gold-rush, so go now before its all built up!).
The ferry between Bali and Lombok took us across the legendary "Wallace Line", though my total ignorance of systematics means I couldn't really tell the difference in flora and fauna between the islands. (Though we did see a HUGE monitor lizard near a brick-making operation.)

I stayed out at the end of this pier on the lower left at the “Secret Island Resort”. (Note the vessel that took us snorkeling to the immediate South... the "Scorpio".
And here's an actual shot of the infamous "Rocky Docky" with the lovely Heather in the foreground:
The reef was right over the edge of the pier, so snorkeling to see all the corals and fishies was optional. You could just look over the edge! Excellent!

Anyways, I've been getting my head back into things and am writing a few blog posts about what’s going on in my world-of-science. So expect a deluge of catch-up posts during the next day or two…

Tuesday, June 22, 2010

Manuscript plans?

So I’m a bit over one year into my postdoc. What have I got to show for myself? Well, plenty of work, but not any papers, or even written manuscripts, so that’s a bit of a problem. Can I turn my first set of genome sequencing data into a manuscript?

Seems likely. I collected >5 Gigabases of Illumina sequence data from several Haemophilus influenzae chromosomes, and this could be used as the basis of a manuscript. I obtained data from a donor strain (86-028NP NovR NalR) and a recipient strain (Rd, RR722) as control data (in order to evaluate the ability of the sequencing and read alignment to correctly identify polymorphisms). I also obtained data from two individual transformants and a pool of four transformants to identify donor alleles in transformed recipient chromosomes. I even found some things out.

Does this a paper make? One outstanding issue is that, in spite of being a lot of data, which has required a fair amount of work to get a handle on, there is not a tremendous amount of biologically relevant data. Yes, I obtained extremely accurate and comprehensive data for the four transformants sequenced. But it was still only four transformants. There are some biologically meaningful results; they just aren’t terribly novel or statistically robust. The bigger biologically meaningful results will have to wait until we can collect more data.

So to turn this into something publishable, the approach and method need to be important enough (and made explicit enough) to be of value to others. So far, I have not done anything in my analysis that is truly novel, but I have managed to produce the bare-bones of a “pipeline” for measuring allele frequencies from pools, and identifying recombination tracts in transformants. The data we got was also extremely high coverage, so we were able to see the limits of the technology fairly well: i.e. depth-of-coverage variation, errors, and issues with read alignment.

Though everything I’ve done so far uses “off-the-shelf” bioinformatics tools, there are so many people trying to do similar things, it might be useful to write a paper that is sort of an “application” of the technology and tools I’ve been using. It took me months to piece everything together, so maybe I could save someone else some time by having everything in one place. But with each passing day, the value of such a paper is probably diminishing, so I’d best get started!

There are still a few analyses I’d like to do that would give the paper a little more spice:
  1. Structural variant analysis: This is something that will involve our collaborators at UVA, who are experts. We can see these pretty well (at least the larger ones), but something systematic has yet to be done.
  2. Reciprocal read mapping: I’ve mapped all the data to both the donor and recipient genomes, but I have not really fully leveraged this fact. The read alignment artifacts that arise mapping data from one strain onto the other could be handled much better, if I was able to assign individual reads to either of the two reference genomes, based on the mapping quality. I’d really like to do this. It’d be novel, mainly because most people sequencing are doing SNP discovery. I already have all my SNPs discovered, so doing an extra good job at calling SNP frequencies using reciprocal alignment would be at least something new. This will take a bit of work, however, and I’ll need to figure out the best computational way to do it. Aside from doing uber-detailed error analysis for a technical paper, I think this is really the best chance to make a novel contribution bioinformatically.

Here is a rough outline of the manuscript, as I’m viewing it so far:

1. Introduction
  • Many bacteria become naturally competent. Natural transformation is important in evolution.
  • Previous sequencing studies have focused on only a handful of defined constructs. (Bacillus, Helicobacter, Actinobacillus).
  • For any organism, the total extent of recombined fragments in individual transformants has never been directly evaluated, and the factors dictating the chance of transformation are only poorly understood as a result.
  • Haemophilus influenzae is a model system for natural transformation. The mechanism is well-defined. Transformation is efficient in the lab strain.
  • The extensive natural genetic variation between H. influenzae strains provides tens of thousands of markers to identify recombination tracts in individual transformants. Not only single-nucleotide, but structural variation in the form of indels and other rearrangments. The “supragenome” hypothesis.
  • We investigated the use of massively parallel sequencing (or “next generation sequencing”, NGS) to characterize natural transformation at a whole-genome scale.
  • Our results show the Illumina platform to be an excellent method to obtain nearly exhaustive information on recombination tracts in individual transformants. Our approach uses the alignment of sequence reads to both donor and recipient reference sequences. We obtained donor and recipient genome sequence as controls for evaluating sequencing error, depth of coverage, and polymorphism identification. We also obtained the sequence data from two individual transformants and a pool of four transformants.
  • Individual recombination tracts are longer than previously appreciated and can bring hundreds of polymorphisms from donor to recipient chromosomes (both single-nucleotide, insertion, deletion, and insertional deletion). However, recombination tracts often appear interrupted by or terminated at sites of structural variation between the two genomes. This shows that such variation are barriers to strand exchange and/or are preferred mismatch repair substrates.
2. Materials and Methods
  • Strains
  • DNA
  • Transformations
  • Library preparation
  • Illumina sequencing and initial data processing pipeline
  • Reference genome alignment by MUMmer, MAUVE
  • Reciprocal read alignment by BWA
  • SAMtools pileup
  • Galaxy pileup parser
  • Variant frequency analysis
  • Assignment of reads (unimplemented)
  • Donor segment calling
  • Analysis of structural variation by HYDRA
3. Results
  • Genetic transformation of competent cells: Marker to marker variation. Dependence on sequence identity. Congression and linkage
  • Illumina sequencing and read alignment: Table of sequencing results and fraction of mapped reads. Variation in depth-of-coverage. Sources of sequencing error and read mapping artifacts. Reciprocal read alignment? Varying alignment stringency?
  • Comparison of donor and recipient strains. Identification of SNPs and structural variants between the donor and recipient strains. Comparison to whole-genome alignment methods.
  • Identification of donor alleles in transformed recipient chromosomes. Accounting for SV alleles. Identifying novel alleles.
  • Identification of allele frequencies in a pool of four transformants.
  • Identification of donor segments and putative recombination tracts
  • Enrichment of SVs at donor segment breakpoints
4. Discussion
  • A first look at transformation… still few transformants
  • Excellent method. Limitations are circumvented by very high coverage, knowledge of both donor and recipient genome sequences, and the use of reciprocal read alignment (unimplemented)
  • Big recombination tracts. Evidence of mismatch repair. SVs as blocks to recombination tract progression.
  • Speculations: Hotspots? Role of uptake specificity? Supragenome transfer?
  • Future: Aside from collecting more transformants, making a transformation frequency map to investigate the “cis-acting” factors controlling the efficiency of transformation. Long-term utility in understanding the population genetics of human pathogens.
Still a rough outline, but something to start with....

Friday, June 11, 2010

Degenerate Uptake: Pilot Study

Something to blog about! (Wow; it's been over a month... sorry to my three loyal blog readers.)

I’ve gotten around to doing a pilot-scale experiment on the specificity of H. influenzae DNA uptake for the “uptake signal sequence” (USS). The USS is a ~29 base pair motif highly abundant in the H. influenzae genome, and sites that match the consensus USS are known to be preferred substrates for DNA uptake by competent cells. The presence of many USS in the chromosome is presumed to be why H. influenzae competent cells prefer H. influenzae DNA over DNA from other organisms.

However, little is known about how the structure of USS contributes to uptake of USS-containing fragments: Limited analyses of mutations of a DNA fragment containing a consensus USS suggests that some but not all informative positions in the USS motif are important to uptake, indicating that other forces (perhaps later steps in transformation) contribute to the structure of the USS motif.

To carefully dissect uptake specificity for the USS motif, we have devised an enrichment experiment:
(1) A complex pool of DNA fragments containing a degenerate USS library is incubated with competent cells.
(2) The fragments preferentially taken up by cells are purified from the periplasm.
(3) DNA sequencing is used to compare the input and periplasm-purified pools of sequences.

Details and Pilot-scale Results:

I’ve previously discussed the design of the input DNA pools. The control 200 bp construct is designed to already contain the sequences needed for Illumina single-end sequencing, along with a 32 bp consensus USS site near the middle of the fragment. The test construct is the same, except the USS is degenerate, having a 24% chance of a non-consensus base at each position. Thus in the degenerate-USS pool, the average site has ~7-8 mismatches from the consensus sequence.

The expectation is that
, while the consensus-USS construct (USS-C) will be taken up by cells well, the degenerate-USS construct (USS-D) will be taken up more poorly, since it contains many suboptimal sequences (i.e. it is less uniformly delicious). Indeed this is the case, with USS-C being taken up about 10 times better than USS-D at sub-saturating DNA concentrations (see below). The notion is that comparing the USS-D input to that taken up by cells will provide a precise measurement of uptake specificity for the USS (i.e. which sequences are tastiest). We think this will tell us a lot about the mechanism of uptake.

It occurred to me a couple weeks ago that before moving on to the data collection (i.e. the DNA sequencing), I should first make sure that the USS-D fragments recovered from the periplasmic purification are taken up better than the original USS-D input (i.e. the competent cells selected more delicious sequences). This would provide the clearest indication that the experiment worked and the material is worth sequencing. It is!

I compared the uptake of USS-C and USS-D before and after periplasmic purification of taken up DNA from rec-2 competent cells across a range of DNA concentrations. Here are the results:
A and B show the % DNA uptake for USS-C and USS-D, respectively, for different amounts of added DNA (to 200 ml competent cultures). C and D show the same data: C is a dose-response curve, and D is a double-reciprocal plot (since I used 2 ng of hot label, along with an additional amount of cold label for these experiments).

Input USS-C and periplasm-purified USS-C were quite similar, while periplasm-purified USS-D was taken up substantially better than input USS-D.

Notably, at low (sub-saturating) concentrations of DNA, periplasm-purified USS-D is taken up less well than USS-C, while at high (saturating) concentrations similar amounts of DNA are taken up. Also of note is that the input USS-D does not saturate until higher concentrations than the other three samples.

This is all good news. I left out a fair number of details, but this pilot-scale experiments is extremely encouraging. Next week, I plan to repeat the experiment, but this time on an appropriate scale for recovering samples for sequencing. I will also investigate how periplasm-purified USS-D samples behave when recovered from uptake experiments with varying amounts of DNA. I expect that at sub-saturating concentration, the cells will be less “picky”, such that periplasm-purified USS-D will be taken up less well than that purified from saturating concentration. This would provide a useful experimental condition, as in the sequence analysis we would be able to investigate the role of competition in shaping USS specificity.

I think this might end up working swimmingly... Onward!

Tuesday, April 13, 2010

Mining some old array data

So in an effort to re-examine some of the lab’s old array data, I made a fairly simple R script to plot the change in expression of competence genes, putative purR-regulated genes, and genes involved in utilizing secondary sugars. We no longer have our expensive license for fancy-pants software, but all I needed to do was some arithematic to columns, and then find the rows of interest, so it’s R-tastic!

I looked at a time course dataset, in which expression was monitored over the course of growth in sBHI and after transfer to MIV. I also looked at a single one-off array comparing purR- to purR+ strains growing in late-log +cAMP.

Here’s the results for the time course. All values are normalized to the first time point. Blue are sBHI timepoints, and red are MIV timepoints. MIV cultures were split from the sBHI cultures at t=0 minutes.

It’s pretty clear that the competence genes are strongly induced in MIV, but are also induced in late-log phase, as expected. Putative PurR-regulated genes are strongly and quickly induced in MIV, indicating that purine pools are quickly depleted, and the purine biosynthetic pathway is activated quite quickly (much faster than the competence genes, it appears). The “non-PTS” genes (several genes induced by CRP when cAMP levels are high) appear to be briefly weakly induced in MIV, as well as being weakly induced in late-log.

Here’s the same sets of genes plotted as the ratio of expression in purR- vs purR+ cultures (late-log, induced with cAMP). Here, I plot the ratios from both array elements for each gene (open and closed circles) and colored them just so they’d be easy to see. Also note, I normalized everything to the median ratio to account for dye effects (under the assumption that the median gene is not PurR regulated). Again, strong induction of the putative purine-regulated genes, a weak repression of the competence genes (presumably due to purine repression), and not much happening with the non-PTS sugars.

Conclusion: Nothing we didn’t already suspect, but it’s good to see that things are behaved as expected. One point of note is that the hypothesized regulation of rec2 by PurR isn’t something that jumps out of this, but if purine repression acts upstream of rec2, we wouldn’t be able to see the effects of deleting PurR here anyways…

Friday, April 2, 2010

SNP densities

So I’ve been writing yet another grant, which has been distracting me from blogging (this isn't supposed to be a monthly blog, but this will hopefully be the last grant application for a while).

But I’ve also been doing several analyses lately. Here’s one. I took the sequences of an ~300 kb restriction fragment from three H. influenzae isolates (Rd, 86-028NP, and PittGG). They’re all similarly divergent from each other (~2.5%), and I wondered how well the level of divergence of Rd vs NP and Rd vs GG correlated along the chromosome...

So I aligned the sequences in Mauve, took its SNP calling output, and did a couple simple sliding window analyses inside R (using the zoo package for rolling means). Here’s what divergence looked like averaged over 5 kb windows (click to enlarge):
The divergence between Rd and the two other isolates are quite well correlated (r2= 0.8, using linear modeling). But since NP and GG are similarly divergent, I made two other plots.

First, here’s a comparison of the density of SNPs that are shared by NP and GG and those that are unique to either NP or GG:
The correlation is a lot worse (r2=0.4).

And if I further break the “unshared” line into NP and GG-specific SNPs (i.e. positions are different between Rd and NP but not GG, and vice versa).
The correlation is worse still (r2=0.2)

Similar results applied to smaller windows, but the plots looked a lot messier. Note that it’s not exactly totally straightforward to measure SNP density... What does one do at indels?? I just ignored them, so the results above are rough. Part of the reason I focused on only a co-linear segment of chromosome was to minimize this problem, but there are still several indels between each of the three strains.

Indels aside, what’s this mean? One of the goals of my transformation frequency mapping is to be able to distinguish the effects of sequence divergence on transformation from the effects of other local chromosomal properties (base composition, sequence motifs, etc.). Since NP and GG have correlated SNP densities relative to Rd, transformation frequencies across the Rd chromosome are expected to also be correlated. Discrepencies in transformation frequency by NP and GG donors could indicate that SNPs specific to the isolates are somehow modulating transformation independent of divergence per se.

Distinguishing chromosome “position effects” from sequence divergence will probably require a third donor DNA. Deciding what this would be requires some thought. All of the sequence H. influenzae are similarly divergent from Rd (and for the most part each other), and phylogeny poorly distinguishes separate clades (i.e. they kind of give a star phylogeny).

So I should use either a strain much more closely related to Rd or one more distantly related (perhaps another species). Using a closely related strain has the advantage that transformation frequencies are expected to be higher and divergence will play less of a role, making the focus more on divergence-independent factors, but I would also have far fewer markers.

Based on MLST comparisons, several strains are sisters of Rd (RM7033, RM7429, RM7271). These assignments are made in several phylogenetic and put the three at ~0.5% divergent from Rd. So I would expect that RM7033 (for example) would have ~6000 SNPs from Rd (far more than our Rd or the other sequenced Rd), ample to have markers across the chromosome...

Tuesday, March 16, 2010

Chromosome "Position Effect"

I got some data for the transformation frequency at five different markers. They varied. This isn’t anything ground-breaking; reports of a "position effect" for transformation go back decades and a couple of recent studies in other organisms bear it out. The underlying cause of variation in transformation rate at different positions likely stem from two sources: the physical structure of the chromosome and the sequence composition of the recombination substrates. The former case is reasonably well-worked out for analogous processes in eukaryotes: For example, in yeast, heterochromatic regions are recalcitrant to recombination, but these sites become recombinogenic in mutants with defective heterochromatin assembly. Sequence composition has also been shown to affect the efficiency of recombinational strand exchange in several different contexts, both genetic and biochemical. This latter type of variation is not traditionally considered a "position effect", but is difficult to distinguish from the former.

Anyways, I wanted preliminary data showing that I can, in fact, detect differences in transformation at different genomic positions, since a big part of my proposed work will involve measuring to very high resolution this position effect...

I used MAP7 donor DNA to transform three independent Rd competent cell preps. MAP7 is highly similar to Rd, except that it carries several point mutations that confer antibiotic resistance. There are likely other unselected differences between Rd and MAP7, but few. Thus, differences between marker transformation rates are likely to predominantly reflect chromosome position effects, rather than sequence divergence between donor and recipient.

This latter point isn’t strictly true: in order to see transformation, a genetic change has to be made, and the selected MAP7 point mutations are genetic differences. But because in our preliminary sequencing data, we saw long stretches of donor-specific DNA with dozens to hundreds of SNPs, I don’t think these single-nucleotide differences are contributing too hugely to the observed variation in transformation rate.

Here’s the data for the five markers individually. Vertical bars indicate the mean transformation frequency per viable cell to the indicated antibiotic resistance allele. The inset circle shows a rough map of the location of the MAP7 markers. (Sorry about the lack of an origin.)
Indeed, I see a ~5-fold range of transformation frequencies, from ~1/500 to ~1/100. Since this is only an arbitrary sampling of five sites, the range of variation across the chromosome could be much higher.

As previously discussed, these values underestimate the transformation frequency per competent cell. Competent cultures typically have both competent and non-competent cells, and the “fraction competence” is typically measured by looking at co-transformation frequencies. These are often higher than expected, even for unlinked markers, a phenomenon termed “congression” and interpreted as a binary distinction between competent and non-competent cells in the culture.

The technical value of this is that I can elevate the observed transformation frequency at one locus by selecting for transformation at another unlinked locus, since this eliminates all non-competent cells from the culture, providing potentially higher sensitivity on our proposed sequencing experiments. It also dampens differences in culture-to-culture variation caused by big differences in fraction competence (not shown).

However, there is at least one old report using Bacillus that suggests congression does not simply reflect a binary distinction between competent and non-competent cells . If cells only came in those two flavors, we would predict that any pair of unlinked markers would show the same level of congression, yet they report that different pairs of markers had different congression frequencies (aside from due to linkage). They go on to suggest an interesting model for their observations, but my concern is more technical:

Does selecting for transformation at different loci affect the tranformation rate at a second unlinked locus?

If the answer is yes, then selection for transformants at a locus would be a poor way to elevate the transformation rate at other unlinked loci, since it would be biased in an unknown way. I also measured co-transformation of Nal resistance and each of the other four. Nal is “unlinked” from all the others (i.e. DNA fragments from standard DNA preps will always be too short to contain the NalR allele with another antibiotic resistance allele), so I can measure “congression” four times.

Here is the data. So, for example, the first bar was calculated as: f(kanR nalR) / f(kanR). This normalizes each bar to the nalR rate (i.e. “the frequency of nalR among kanR transformants”).
The first thing to note is that the scale bar has changed relative to the transformation/cfu. For each of the 3 cultures, there was ~3.5 fold increase in the observed transformation rate, which would be expected if ~1/3 of cells in each culture were competent.

The second thing to note is that selecting for any of the four markers had no effect on the NalR transformation frequency. So the answer to the above question is no. Phew! The Bacillus result was cool, but I’m glad it isn’t the case here. A binary competent/non-competent model is perfectly reasonable in our system (though this does not exclude the possibility of variation among competent cells). With this in hand, I can now plot the co-transformation data with respect to NalR. If selecting for NalR only eliminates non-competent cells but does not change the underlying transformation frequencies per competent cell at the other unlinked markers, then life is good.

Here’s the data. So for example, the first bar was calculated as f(kanR nalR) / f(nalR). This normalizes each bar to its own rate (i.e. “the frequency of kanR among nalR transformants”). For the nalR/competent cells, I used the average of all 12 points in the previous plot.
This data closely resembles that of the first figure, except all the values are ~3.5 -fold higher.

Woo! Next I should probably repeat congression data for linked markers, and repeat experiments with more divergent donor DNA.

Sunday, March 14, 2010

Repression of competence induction by purines

My illustrious colleagues have been re-examining some of the lab’s old data regarding the repression of competence by purines. The work has been slowly ongoing for years, and it may be close to being a complete story. I want to try and express what I think their model is and what seem to be its predictions, so they can tell me whether my understanding is straight…

First, a schematic depiction of how I interpret what we already know the induction of the competence regulon:
What we knew: Transferring cells growing in rich medium to competence medium induces 15 operons driven from the novel CRP-S promoter, and then cells become naturally transformable. Cells in competence medium have elevated cyclic AMP levels, directing the CRP protein to induce expression of genes with canonical CRP-N promoters, including the sxy gene. Sxy protein alters the binding specificity of CRP to also bind at CRP-S promoters, thereby inducing competence gene expression.

But Sxy levels are also regulated at translation, in addition to at transcription. The wild-type sxy mRNA transcript contains a stem-loop structure that inhibits its translation. Mutations that disrupt the stem-loop structure in the 5’-UTR are hypercompetent (e.g. the sxy-1 mutation). In wild-type cells, unknown factor(s) disrupt the stem-loop to induce the translation of sxy transcript in competence medium.

Now a schematic depiction of how I interpret the model for purine repression of competence:
The observation: Addition of purines to competence medium represses competence. Purine biosynthesis is repressed by the PurR protein when cellular pools of purines are high. Deletion of the purR gene reduces competence (presumably indirectly, by increasing cellular pools of purine), but mutations disrupting the sxy 5’UTR’s stem-loop suppress the purR mutant defect (Rosie’s last post).

A hypothesis: The sxy transcript stem-loop is stabilized in the presence of purines (either directly or indirectly), blocking the production of Sxy protein and thus the activation of the competence regulon. When purine pools are depleted, the stem-loop is disrupted. This predicts that addition of purines and purR mutations will inhibit sxy translation more than sxy transcription.

A corollary hypothesis: Purines block DNA translocation by PurR-dependent repression of the rec-2 gene, whose promoter contains a putative PurR binding site. A potential test of this hypothesis would be to treat sxy-1 competent cultures with purines. We would predict that if PurR directly represses rec-2, DNA translocation would be inhibited (but DNA uptake would not). Obviously, checking rec-2 transcription relative to other competence genes would make sense here as well, but the functional test would be most compelling.

Is that the basic notion? I know there’s a bunch of other experiments that have been done that I need to find out about…

Thursday, March 11, 2010

What have I got?

Okay, grant planning part 2... Below is a dense description of the preliminary data I have/will have for writing this next grant...


Transformation frequency depends on chromosome position:
DNA from a multiply marked derivative of Rd (MAP7) was briefly incubated with competent Rd cultures. The resulting transformation frequency at each of four loci was evaluated by selecting for cells that acquired the corresponding MAP7-specific antibiotic resistance allele. MAP7 DNA transformed each Rd locus at a different frequency. (Repeat experiment in progress… stay tuned but looks good.)

Sequence divergence decreases transformation frequency:
DNA from an antibiotic-resistant derivative of NP (1350NN) transformed Rd competent cells less efficiently than did DNA from MAP7, and vice versa. NP differs from Rd by ~2.4% per alignable base position (and an additional 10% of each genome is absent from the other, contained in indel polymorphisms) while the transformation frequencies at two loci were affected ~2 to 4-fold. (Data in hand.)

Co-transformation frequencies are non-random due to congression and linkage:
(Wish I didn’t have to describe this, but it’s too fundamental. Data mostly in hand.)

Transformants acquire hundreds of donor-specific alleles:
Several large DNA fragments recombined into the chromosomes of four individual Rd competent cells, as revealed by genome sequencing. Each of the four transformants was selected for resistance to one of two antibiotics encoded in the 1350NN strain (two NalR and two NovR), and the corresponding donor-specific allele was present in each of the four. In all, 24 donor segments (contiguous stretches of donor-specific alleles) were found across the 4 transformants, with an average of 1.4% of each recipient chromosome replaced with donor DNA (~25 kb and ~600 SNPs each). Mismatch repair is likely responsible for the disruption of contiguous stretches of donor-specific alleles in the transformants; assuming that for closely adjoined segments this was true, a total of 10 (instead of 24) independent transformation events occurred across the four transformants (6 of which were unselected; notably two of these were overlapping in independent transformants). (Data in hand; re-analysis in progress.)

DNA uptake signal sequences (USS) are densely distributed in the two genomes:
Both the Rd and NP chromosomes contain USSs nearly every kilobase and most are syntenic. (Cursory data only. Need a better analysis.)

Sequence preferences in DNA uptake can be captured by periplasmic DNA purification:

DNA fragments containing uptake signal sequences are efficiently taken up into cells, and taken up fragments can be cleanly purified away from both free DNA and chromosomal DNA. The use of rec-2 and rec-1 mutations will facilitate separating sequence biases at different stages of natural transformation. (Data in hand, except rec-1.)

  • Four/five marker transformation rates
  • Rd vs NP transformation rates
  • SNP spacing histogram with embedded SV table
  • Genome sequencing figure (pool data)
  • USS analysis
  • Molecular biology figure (uptake data)


Tuesday, March 9, 2010

Another day, another attempt to get a dollar

Sigh... another grant due soon; this time, it's my last attempt to get an NIH postdoctoral fellowship. My last reviews mainly took issue with my proposal, which they found to be overly ambitious and somewhat unfocused. So below, is my first attempt at a summary/specific aims page, followed by a couple of preliminary data collection things I'd like to do before it's due (on April 8)...

The introduction:
Naturally competent bacteria take up intact DNA from their surroundings and can incorporate it into their chromosomes by homologous recombination. Akin to sexual recombination in eukaryotes, this natural transformation pathway moves alleles and genes between otherwise clonal lineages; and human bacterial pathogens have used this pathway to share antibiotic resistance genes, antigenic determinants, and virulence factors. To better elucidate the mechanism of transformation and to inform population/epidemiological studies, the proposed work will use the opportunistic Gram-negative bacterium Haemophilus influenzae to disentangle the sequence biases intrinsic to the DNA uptake and DNA recombination mechanisms by combining classical microbiology with modern DNA sequencing.

The specific aims:
  1. Define the genetic consequences of natural competence to H. influenzae. Transformation frequencies vary for different sequences and at different chromosomal locations, and this could strongly influence the rate of sequence evolution and adaptation along the genome. I will transform competent cultures of the standard lab strain with the genomic DNA of a clinical isolate and use deep sequencing to measure transformation across the lab strain’s chromosome for all the ~40,000 sites differing in the clinical isolate. This will provide an unparalleled dataset for investigating the sequence factors that promote and limit genetic exchange between bacterial cells.
  2. Measure the contribution of DNA uptake specificity to natural transformation. In several human pathogens, including H. influenzae, the uptake machinery prefers DNA fragments containing short “uptake sequences”, and abundant sequence motifs in many bacterial chromosomes suggest that biased DNA uptake has had a profound influence on genome evolution. I will purify the intact DNA molecules taken up into the periplasm and cytosol of competent cultures and use deep sequencing to measure the sequence biases of the uptake machinery. In combination with (a), this will disentangle the contributions of DNA uptake from those of DNA recombination during natural transformation.
The platitudes: The proposed work will link molecular studies of transformation to the growing genome sequence data being collected from many isolates of many bacterial species. By establishing my approach with completely sequenced chromosomes and using a well-defined experimental system, later studies could include a greater diversity of sequences or mimic more and more natural conditions. As a directly applicable outcome, the work will also produce the beginnings of a new type of genetic resource for mapping traits that differ between natural bacterial isolates (as in eukaryotic quantitative genetics) by generating fully genotyped recombinants. In the future, such studies will give empirical underpinnings to population genomic studies of bacterial genetic exchange, as well as provide new testable hypotheses for investigating the molecular mechanism of transformation.

Preliminary data I would like: (besides what I’ve got)
  • Properly replicated transformation frequencies for several markers. (I did this before, but it hasn’t been properly replicated.)
  • Follow a few molecules through uptake and recombination?
  • Population genetic inferences of “recombination” in H. influenzae? (I did this before, but it sucked.)


Friday, March 5, 2010

Multiplexing sans barcodes

Previously, I’d said I wanted to go over how we might obtain many recombinant genotypes by deep sequencing pools of recombinants, since our tiny genome is TOO EASILY SEQUENCED using modern methods, making the sequencing of individual clones inefficient. The challenge is then in assigning donor DNA segments in the pools to individual clones. To a first approximation, this isn’t really necessary, since one of our main motivations for sequencing recombinants is simply to determine whether the locations and endpoints of donor DNA segments are biased: i.e. whether there are recombination hotspots or whether certain types of donor-recipient differences are recalcitrant to recombination.

However, our preliminary data showed that donor segments were often clustered in individual recombinants, probably due to mismatch repair disrupting larger donor fragments during transformation. We were only able to pin this down, because we individually sequenced 2 of the 4 transformants that we’d pooled. To illustrate, here’s a zoom of the region containing one of our selected sites at gyrB. 2 of 4 clones carry the causal allele (the red dot). But the pool data indicates several additional segments:

Are they in different clones? The same clone? How do we disentangle, without sequencing individuals (as was done here; shown as sets of colored bars at the top)?
Several methods for handling pooled data exist. The one typically referred to is “barcoding” where samples are processed individually and have unique sequence codes added during library construction, so that individual sequence reads can be assigned to individual clones. This is powerful method, but extremely expensive and labor-intensive. It surely has useful contexts, but for our purposes, we don’t really need to assign every read to every clone… only donor segments.

An alternate approach, outlined below, would simply ensure that any given clone appears in two different otherwise non-overlapping pools. In its simplest form this would simply be to pool by rows and also by columns (other more involved ways are here and here). I recently did a transformation experiment, where afterwards I grew up independent transformants in 64 wells of a 96-well culture plate.

They were arrayed in a checkerboard grid… 8X8 clones (yellow = NalR, and blue=NovR). If I prep DNA from all these clones, I could then produce Row Pools 1-8 and Column Pools A-H and each would have four clones of each resistant type. One issue would be distinguishing which endpoints belong together when segments are overlapping; another issue would be deciding which segments belong in the same clone.

If a donor segment appeared in clone 3C, for example, and it had unique endpoints (i.e. that donor segment is present only in clone 3C), then we would see those unique endpoints solely in pool 3 and pool C.

So we would have no difficulty assigning the segment to clone 3C.

On the other hand, if the segment was NOT unique, but present in, say clones 3C and 7E, we’d be unable to assign the segment to a particular clone due to "ghost" signals, but would instead know that there were two identical segments, but either in 3C and 7E, or in 3E and 7C.

(We’d be able to do this, since we’d still know the frequency of the segment in the different pools.)

So this is a good plan. We could first sequence by rows, giving us 64 more clones worth of data. And as long as there aren’t a whole bunch of identical endpoints for independent donor segments, we could then sequence pooled columns to assign segments to clones. If there were tons of identical endpoints, this would be such a shocking result, we’d need to re-think our next step anyways…