Monday, November 30, 2009

Recombinant Genomes: First Pass

We’ve now got deep sequencing data of donor, recipient, and transformant genomes. And it is, indeed, “deep”, at least in quantity. Here’s the post where I illustrate the details of how the sequenced DNA was obtained. There were five “lanes” of sequencing obtained, of which I will talk about the first four here:

Lane 1: Rd (the recipient genome)

Lane 2: 1350NN (the donor genome 86-028NP plus NovR and NalR alleles of gyrB and gyrA, respectively)

Lane 3: Transformant A (the genome of a NovR transformed clone)

Lane 4: Transformant B (the genome of a NalR transformed clone)

So I’ve done a first-pass analysis of the sequencing data using Maq as the mapping algorithm. I left everything at default settings, and so far have only analyzed the datasets with respect to the recipient genome, Rd.

The expectation is that when I map Lane 1 to Rd, there will be few differences detected by Maq; when Lane 2 is mapped to Rd, the bulk of SNPs between Rd and 86-028NP will be detected; and when Lanes 3 and 4 are mapped to Rd, the donor DNA segments transformed into the recipient background will be identified. For lanes 3 and 4, since we know the locus that we selected for donor information, we have controls for whether or not an appropriate donor segment was detected.

Before continuing, I must specify that this is a very preliminary analysis: I do not consider the quality of the SNP calls, beyond whatever Maq does to make the calls (and culling ambiguous SNPs, as described below). I have not mapped anything with respect to the donor genome. I have not considered any polymorphism between the strains other than simple single-nucleotide substitutions (since Maq only does ungapped alignment). I also missed regions of very high SNP density, since the Maq default will not map any read with >2 mismatches from the reference. Finally, I have only cursorily examined depth of coverage across each genome (it is ~500-fold on average, ranging from ~100 to ~1000 over each dataset).

However, even with these caveats, the approximate sizes and locations of transformed DNA segments were pretty clear...

Here are the number of “SNPs” called by Maq between each dataset and Rd:

Lane 1: 933 (Rd)
Lane 2: 30,284 (1350NN)
Lane 3: 1,870 (TfA-NovR)
Lane 4: 1,881 (TfB-NalR)

Rd versus Rd

The first obvious issue is that when the Rd DNA was mapped against the Rd genome sequence, Maq called 933 “SNPs”. What is all this supposed “variation”?

108 had “N” as the reference base
432 had an ambiguous base as the query

So 540 / 933 “SNPs” are easily explained artifacts-- either ambiguous positions in the complete Rd genome sequence reference, or ambiguous base calls by Maq from the Illumina GA2 dataset.

The remaining “SNPs” may also be persistent sequencing/mapping artifacts, or they may be true genetic differences between the “Rd” we sequenced (our lab’s wild type) and the original DNA sample sequenced back in the 1990s.

To simplify matters I culled any position called a “SNP” between Rd and Rd, as well as all other ambiguous positions, from the remaining datasets.

Rd versus 1350NN

Before turning to the tranformants, I used Lane 2 to make a list of detectable SNP positions between the donor DNA and the recipient chromosome. Of the 30,284 “SNPs” detected by Maq, 29,002 were neither in the Rd SNP set nor had an ambiguous base in either the reference or query.

Note that I am not using SNPs identified by comparison of the two complete genome sequences, but rather those that were unambiguously determined by this sequencing experiment. Rd remains the only reference genome I have used.

I used this set of SNPs as the set of “Donor-specific alleles” to map transforming DNA segments.

The transformants

To identify donor-specific alleles in the transformants, I took the intersection of the “Donor-specific alleles” and the unambiguous SNPs identified by Maq for the two transformants, yielding the following number of SNPs in each transformant:

TfA: 890
TfB: 975

This suggests that about 3.0-3.5% of each transformant genome consists of donor DNA. This value is consistent with what we might expect, based on the co-transformation frequency of the two donor markers into the recipient genome when I did the original transformation that produced the sequenced clones.

Here are plots of the TfA and TfB genomes (genotype 0 = recipient allele, genotype 1 = donor allele):
(Note that images can be enlarged by clicking)

This makes me happy. I was kind of hoping our thinking was wrong and that there’d be all kinds of kookiness going on, but in many ways, having our expectations met is a vastly better situation, since it means that the designs of our other planned experiments are probably sound.

Note that the right-most "donor segment" represents only a single "donor-specific allele" that is identical in both recombinants, as well as being surrounded by Rd vs. Rd SNPs. It is highly likely that this singleton is an artifact. All other donor segments are supported by many SNPs, including the overlapping segments ~200,000 bp. This latter shared segment may suggest a hotspot of recombination.

Control loci

The following two plots zoom in on the segments containing the control selected loci. The first has red lines bounding the gyrB gene in TfA, while the second has blue lines bounding the gyrA genome in TfB. Also in each plot, the masked positions (those that were left out due to ambiguity or presence in the Rd vs. Rd comparison) are show (for the first, in orange; for the second, in grey):
Within each marker gene, there are a few recipient-specific SNPs. These have to do with the fact that the PCR fragment I used to add the NovR and NalR alleles of gyrB and gyrA contained recipient SNPs and some of these ended up in the donor genome.

Okay! That’s almost it for now. There’s a looong way to go, but happily I suspect that there is indeed biology to learn even from these “meagre” couple of genome sequences.

My next task will be to account better for where I am blind. I used stringent criteria to determine allele identity here. I am quite confident in what was found, but I’m not sure sure about what I didn’t find. That is, I suspect false negatives, but not false positives, based on how I’ve done this so far.


  1. Holy crap dude! You are doing it! Awesome-tacular.

  2. So, am I correct to understand that the recombination lenghts are about 30kb? How big are the recombination tracts?