Monday, November 30, 2009

Recombinant Genomes: First Pass

We’ve now got deep sequencing data of donor, recipient, and transformant genomes. And it is, indeed, “deep”, at least in quantity. Here’s the post where I illustrate the details of how the sequenced DNA was obtained. There were five “lanes” of sequencing obtained, of which I will talk about the first four here:

Lane 1: Rd (the recipient genome)

Lane 2: 1350NN (the donor genome 86-028NP plus NovR and NalR alleles of gyrB and gyrA, respectively)

Lane 3: Transformant A (the genome of a NovR transformed clone)

Lane 4: Transformant B (the genome of a NalR transformed clone)

So I’ve done a first-pass analysis of the sequencing data using Maq as the mapping algorithm. I left everything at default settings, and so far have only analyzed the datasets with respect to the recipient genome, Rd.

The expectation is that when I map Lane 1 to Rd, there will be few differences detected by Maq; when Lane 2 is mapped to Rd, the bulk of SNPs between Rd and 86-028NP will be detected; and when Lanes 3 and 4 are mapped to Rd, the donor DNA segments transformed into the recipient background will be identified. For lanes 3 and 4, since we know the locus that we selected for donor information, we have controls for whether or not an appropriate donor segment was detected.

Before continuing, I must specify that this is a very preliminary analysis: I do not consider the quality of the SNP calls, beyond whatever Maq does to make the calls (and culling ambiguous SNPs, as described below). I have not mapped anything with respect to the donor genome. I have not considered any polymorphism between the strains other than simple single-nucleotide substitutions (since Maq only does ungapped alignment). I also missed regions of very high SNP density, since the Maq default will not map any read with >2 mismatches from the reference. Finally, I have only cursorily examined depth of coverage across each genome (it is ~500-fold on average, ranging from ~100 to ~1000 over each dataset).

However, even with these caveats, the approximate sizes and locations of transformed DNA segments were pretty clear...

Here are the number of “SNPs” called by Maq between each dataset and Rd:

Lane 1: 933 (Rd)
Lane 2: 30,284 (1350NN)
Lane 3: 1,870 (TfA-NovR)
Lane 4: 1,881 (TfB-NalR)

Rd versus Rd

The first obvious issue is that when the Rd DNA was mapped against the Rd genome sequence, Maq called 933 “SNPs”. What is all this supposed “variation”?

108 had “N” as the reference base
432 had an ambiguous base as the query

So 540 / 933 “SNPs” are easily explained artifacts-- either ambiguous positions in the complete Rd genome sequence reference, or ambiguous base calls by Maq from the Illumina GA2 dataset.

The remaining “SNPs” may also be persistent sequencing/mapping artifacts, or they may be true genetic differences between the “Rd” we sequenced (our lab’s wild type) and the original DNA sample sequenced back in the 1990s.

To simplify matters I culled any position called a “SNP” between Rd and Rd, as well as all other ambiguous positions, from the remaining datasets.

Rd versus 1350NN

Before turning to the tranformants, I used Lane 2 to make a list of detectable SNP positions between the donor DNA and the recipient chromosome. Of the 30,284 “SNPs” detected by Maq, 29,002 were neither in the Rd SNP set nor had an ambiguous base in either the reference or query.

Note that I am not using SNPs identified by comparison of the two complete genome sequences, but rather those that were unambiguously determined by this sequencing experiment. Rd remains the only reference genome I have used.

I used this set of SNPs as the set of “Donor-specific alleles” to map transforming DNA segments.

The transformants

To identify donor-specific alleles in the transformants, I took the intersection of the “Donor-specific alleles” and the unambiguous SNPs identified by Maq for the two transformants, yielding the following number of SNPs in each transformant:

TfA: 890
TfB: 975

This suggests that about 3.0-3.5% of each transformant genome consists of donor DNA. This value is consistent with what we might expect, based on the co-transformation frequency of the two donor markers into the recipient genome when I did the original transformation that produced the sequenced clones.

Here are plots of the TfA and TfB genomes (genotype 0 = recipient allele, genotype 1 = donor allele):
(Note that images can be enlarged by clicking)

This makes me happy. I was kind of hoping our thinking was wrong and that there’d be all kinds of kookiness going on, but in many ways, having our expectations met is a vastly better situation, since it means that the designs of our other planned experiments are probably sound.

Note that the right-most "donor segment" represents only a single "donor-specific allele" that is identical in both recombinants, as well as being surrounded by Rd vs. Rd SNPs. It is highly likely that this singleton is an artifact. All other donor segments are supported by many SNPs, including the overlapping segments ~200,000 bp. This latter shared segment may suggest a hotspot of recombination.

Control loci

The following two plots zoom in on the segments containing the control selected loci. The first has red lines bounding the gyrB gene in TfA, while the second has blue lines bounding the gyrA genome in TfB. Also in each plot, the masked positions (those that were left out due to ambiguity or presence in the Rd vs. Rd comparison) are show (for the first, in orange; for the second, in grey):
Within each marker gene, there are a few recipient-specific SNPs. These have to do with the fact that the PCR fragment I used to add the NovR and NalR alleles of gyrB and gyrA contained recipient SNPs and some of these ended up in the donor genome.

Okay! That’s almost it for now. There’s a looong way to go, but happily I suspect that there is indeed biology to learn even from these “meagre” couple of genome sequences.

My next task will be to account better for where I am blind. I used stringent criteria to determine allele identity here. I am quite confident in what was found, but I’m not sure sure about what I didn’t find. That is, I suspect false negatives, but not false positives, based on how I’ve done this so far.

Tuesday, November 24, 2009

Incoming Data Deluge

I am waiting with anticipation... I am about to be drowning in data... >64 million quality filtered paired-end sequencing reads, constituting 5.4 Gigabases! Woo!!

The files are currently transferring to an FTP site that I will then download from. Tomorrow afternoon, I may legitimately be a genomicist!

Thanks to my friend and his tech for getting this done...

Friday, November 20, 2009


On Wednesday, I put up some tables of data with bad calculations of molecules per cell. Rosie's concerns about the data were misplaced; they should've been concerns about my brain!

I'd been saying that 8 ng of 200 bp DNA / 200 ul of competent cells was ~ 1 molecule per cell. But I was off. Not just because I calculated it for pg instead of ng, but because I'd screwed up other parts of the math...

Here's a more correct way to approximate the number of molecules per cell starting with one of Rosie's Universal Constants:

(1e-18 g / one 1 kb DNA) * (one 200bp DNA / five 1kb DNA) =
(2e-19 g / one 200 bp DNA) * (1e9 ng / 1 g) =
2e-10 ng / one 200 bp DNA.
So that's about how much a single 200 bp DNA molecule weighs.

Then the inverse gives molecules per ng:
1 / (2e-10 ng / one 200 bp DNA) =
(5e+9 molecules DNA / ng DNA) * (8 ng) =
4e+10 molecules DNA / 8 ng DNA.

Adding 8 ng of DNA to 2e+8 cells (in 200 ul of competent cells), then gives:
200 molecules per cell...

That's better. And more in-line with the lab's previous work...
I have changed the Tables in Wednesday's post to reflect this correction.

Thursday, November 19, 2009

Bacterial Genome Sequence Links

I keep meaning to put up these links, so I can find them in the future. They get you to NCBI's genome FTP site. If you click on the genome you want, you get a listing of a bunch of file formats. For the complete genomes, the choice for the FastA of the complete sequence is the .fna file. But there’s a bunch of others, which makes getting the correctly formatted data for application X easier:

Bacterial Genomes FTP

For the draft assemblies, the way to get the data is at this link:
Whole Genome Shotgun FTP

But to find out which accession you want is a little irritating and the best way is via searching this list:
Whole Genome Shotgun HTML

Using the supplied links from this last list is not so useful for downloading the data as is the FTP link, since you would have to click a maddening amount to get all the contigs from a particular whole genome shotgun sequencing project.

Wednesday, November 18, 2009

Saturation and Chaser

EDITED: Used bad mol per cell calculation earlier.

How much DNA does an average cell in a competent culture take up? I did two experiments today, which were aimed at trying to decide how much degeneracy we can tolerate in our degenerate USS experiment. There's several issues, but one simple one is just to know how much of an "optimal" USS is taken up in the presence of competing "suboptimal" USS...

In the first experiment, I added different amounts of hot USS-1 DNA (the new-fangled ones designed for Illumina sequencing... ~200 bp) to a fixed amount of competent cells to find out how much DNA was enough to saturate the cells with DNA. Here's the results for several amounts of USS-1 incubated with 200 ul of competent cells for 20 minutes:
Looks like less than 8 ng (or ~200 molecule per cell) is sub-saturating and greater than 8 ng is near saturation.

In the second experiment, I looked at how well USS-1 is taken up in the presence of a competitor mutant USS (USS-V6, which is taken up 10X worse than USS-1). Based on the results of the first experiment, I used two different amounts of hot USS-1, either sub-saturating (4ng) or saturating (16ng). For each of these, I had four samples in which different amounts of cold USS-V6 were added. Here's the results for subsaturating USS-1:

And now for saturating USS-1:

Though adding the competitor did reduce uptake, uptake decreased only by ~1/4, even when there was several-fold more competitor. Also, it appears that using saturating amounts of USS-1 yields a different decline in uptake with excess inhibitor than when USS-1 is sub-saturating.

So what's this all mean? I think it means that even when "optimal" USS are only 1/10 as common as "suboptimal" USS, uptake proceeds quite nicely. What does this mean practically for making our fancy oligo purchase? Not too much, except that I think we can err on the side of more degeneracy without too many problems. (This will be good for several reasons, probably the most important of which is that our preliminary data collection plans will work best with more degeneracy.)

Tuesday, November 17, 2009

USS uptake with Illumina adaptors

(I'm somehow deeply amused by the above image. I think it has something to do with the arrow going from the DNA molecule to the sun.)

I've finally gotten around to doing some uptake experiments with the control sequences I got for our degenerate USS experiments. The important thing was to make sure that the constructs would behave as our older ones do with our new-fangled design that will allow for Illumina sequencing directly from the purified periplasmic DNA (i.e. no library construction steps)...

And it worked quite well! I also scaled these experiments down, so that I could more easily do some saturation curves tomorrow.

Results today for uptake of 8 ng of 200 bp DNA (~160 million molecules) by ~200 million competent cells (in 200 ul):

O.G.-USS1 -> 33.5%
New-USS1 -> 35.4%
New-USSV6 -> 3.7%
New-USSR -> 0.7%

Right on target!

Missing Posts

Whew! I keep falling behind on my blogging.... Here are the posts that I haven’t written:

Using R:

(1) Simulation of degenerate USS experiment, assuming the USS motif defines uptake specificity.

(2) Scoring genomes for USS sites.

(3) Scoring alignments for USS sites. (This last bit is finally begun, but far from complete.)

I've got the figures for these posts; now I just need text!

Friday, November 6, 2009

Weening myself off of Excel

In some sense, the computing I did today isn’t really useful, since I already worked out these things using Microsoft Excel. But I’ve been ordered by my bioinformatics consultants to stop with the Excel already. So as practice, I worked out some of the expected features of degenerate oligos again, but this time using R.

The main motivation for doing this besides practice is that I am fairly sure we should be ordering degenerate oligos with more degeneracy than we have previously considered. I won't make that argument here, but just repeat some analytical graphs I'd previously made.

It took a while (since I’m learning), but was still much more straight-forward than doing it in a spreadsheet. The exercise was extremely useful, as I learned a bunch of stuff (especially about plots in R), while doing the following:

Problem #1: Given a percentage of degeneracy per base, d, in an n length oligo, what is the proportion of oligos with k mismatches?
Answer #1: Use the binomial distribution. For a 32mer with different levels of degeneracy (shown in legend):
Problem #2: Given a million instances of such an oligo, how well would each possible oligo with k mismatches be observed?
Answer #2: Simply adjust each of the above values by dividing the number of classes within each of k mismatches (i.e. choose(n, k)):
Problem #3: If some number of bases, m, in the n-length oligo are “important”, what proportion of oligos with k mismatches will have x “hits”?
Answer #3: Use the hypergeometric distribution. The below plot is as for Problem #1 for 0.12 degeneracy, but with the # of hits broken down for each k:
I didn't try super-hard to make the perfect graphs, but it did take some effort to make a stacked bar plot...

Funding versus Science

Yesterday, I ended up doing none of the things I intended to do, but cleaned them instead: (1) No bench-work, but a clean bench (and fridge and freezer spaces)!; (2) No emails, but a clean inbox!; (3) No thinking, but a clean mind! (Well, moreso than usual.)

My ears were also itching, because I realized that the NIH panel that's review my postdoc grant application was meeting, and was extra-worried, since now the Genome BC grant almost certainly will depend on the outcome of those scores.

Anyways, towards the end of the day, I switched over to browsing journals, which I rarely do these days...

I found a lot of good stuff I probably already should've known about, but I also found this opinion piece in PLoS about "what's wrong with funding of research". Pretty much sounds about right. I am not even in a position to have the stresses described in the article, but feel like if I don't get some kind of postdoc fellowship of my own, I'll be that much less likely to get hired as an independent researcher, since so much of what defines success is the ability to get money. But all of the PIs I know are overwhelmed by exactly these issues. It's especially daunting to realize that a full-sized NIH R01 grant can barely support a lab with 2-3 people.

On the other hand, my recent experiences with Rosie do bring home the fact that grant-writing is a good way to think rigorously about one's plans, and the Genome BC experience in particular (whether the grant is funded or not) helped our brains wrap around exactly what we'll be doing in the next couple of months.

Wednesday, November 4, 2009


Whew! Rosie and I just made what I consider a heroic effort to produce a grant application to Genome BC to use DNA sequencing to measure recombination biases during H. influenzae natural transformation.

It was heroic not only because we finally decided to apply only late last week, but because our co-funding support is tenuous at best. Genome BC requires that we match their funds at least equally with funds from another source. Rosie has funding, but it was applied for too long ago and the proposal only indirectly relates to our planned sequencing. We also applied for a CIHR grant recently, but will not have reviews until after the Genome BC committee meets in early January. The best hope of adequate co-funding comes from my NIH postdoctoral fellowship grant application (a resubmission) late this summer, for which I should have scores (or lack thereof) within a couple of weeks. If the application gets a good score, we can tell Genome BC that the major threat to the success of our application is ameliorated.

Almost immediately after submitting the grant application with Rosie last night, I had to turn to editing my buddy's manuscript (which I am an author on), which takes on the weighty topic of detecting rearrangements in complex mammalian genomes from limited sequencing data. I just turned my edits over to the corresponding author, and now, after letting the excess nitrogen out of my bloodstream, I need to decide what to do next.

Since aforementioned buddy also has taken several DNA samples off my hands for sequencing, I think I'd best turn to purifying uptake DNA from the periplasm of competent cells. I've already gotten things fairly well under way (see here, here, and here), but there's a bunch of uptake experiments waiting to be done. So tonight, I'll inoculate some cultures, so I can try a large-scale periplasmic DNA prep tomorrow, and tomorrow I'll also order some more radiolabel for doing more sensitive uptake experiments.