Friday, August 28, 2009

Homoplasy versus Recombination

Making up for lost blogging time!

Just to get started doing something with this alignment, I just grabbed a single 68,580 bp alignment block out of the XMFA file I generated (as described in the last post), which was itself a simple FastA formatted file. So I could use whatever programs could handle a simple FastA alignment and not worry about converting around to different formats yet, or coping with contig and rearrangement boundaries.

Here’s a picture from MAUVE of the aligned block I used (KW20 Rd reference coordinates 401,890 to 467,839). It’s the orange block shown below.
Based on some recent reading, I tried a couple different programs designed to detect signals of recombination from multiple alignment files. There is a serious caveat here, in that I do not understand exactly how these programs work, and since they were designed for use on sexual eukaryotes, it’s quite possible that the assumptions of the programs will not be accurate when applied to Haemophilus. Nevertheless, I sallied forth, just to see if they’d at least work with my sequences. I can come back and try to understand the meaning of the output more slowly...


First, I tried LDhat, which is composed of a suite of programs aimed at detecting recombination in a population sample. I didn’t have any difficulty downloading and compiling the program, which was nice. I wish I could tell you more about it, but I never actually made it work, so there’s not much point yet. I got the “Convert” program to make the requisite files from an only slightly modified FastA, and I eventually got the “Pairwise” program to work. This program generates a “look-up table” which is required for the other programs to operate.

At first, the program kept choking on my data and was unable to produce the requisite look-up table for my data. So I fed it a pre-made file that can be found at LDhat’s website. The look-up table I used was first converted to a 15 taxa file using their “LKgen” utility. Then “Pairwise” worked and generated me a new look-up table that should theoretically have worked with the other programs. I next used the “Interval” program, but I could never get past here. It wouldn’t do the analysis, because the look-up table was somehow lacking. It was not “exhaustive”, whatever that means, and there was apparently a disparity between “npt” (which I assume stands for Nonparametric test) and the data.

No clue. Moving on. (I’ll try to return to this some other time, since the program “Rhomap” looks like what I’m looking for).


Next I turned to PhiTest, which was also easily downloaded and compiled. This program uses a principle that seems pretty sensible. It examines “incompatibilities” in phylogenetic signals. If two lineages of bacteria diverge and never recombine, then adjacent polymorphisms will most likely be “compatible”, that is both polymorphic sites will have the same phylogenetic signal (i.e. they will support the same tree topology). On the other hand, “incompatible” sites could have two possible histories, one in which there was recurrent mutation in different lineages and one in which there was recombination between lineages. This is illustrated nicely in a figure from the paper reporting PhiTest:

"Figure 1:
The dual nature of incompatibility. Two possible histories for a pair of incompatible sites are shown: (a) two incompatible sites explained by a recombination event and (b) two incompatible sites explained by a convergent mutation. Mutations in the first site are indicated by open circles and mutations in the second site are indicated by solid circles. To explain the incompatibility between the pair of sites either a recombination event must be invoked or a homoplasy must have occurred in the history of one of the sites."

Figure 1A:
Figure 1b:
So in (a) the black mutation only occurred once and was subsequently shared between the central lineages, whereas in (b) the black mutation occurred independently in the two lineages.

They more rigorously define “compatibility” thusly: “Two sites i and j are compatible if and only if there is a genealogical history that can be inferred parsimoniously that does not involve any recurrent or convergent mutations (known as homoplasies as in Figure 1b). If the two sites are not compatible, they are termed incompatible. Under an infinite-sites model (KIMURA 1969) of sequence evolution, the possibility of a homoplasy does not exist, and so incompatibility for a pair of sites implies that at least one recombination event must have occurred, as in Figure 1a.”

So the idea of recombination detection algorithms is to exploit the notion of incompatibility, but it is extremely important to note that there are really two possibilities for any incompatibility: (a) It is a true “homoplasy”, i.e. a recurrent mutation; (b) it indicates recombination. The basic idea of the NSS method (which is apparently what LDhat is using) is to look at neighboring pairs of sites and find regions where incompatibilities are clustered together, suggesting perhaps a hotspot.

The other, possibly more eukaryotic, aspect is that sites that are more distant will tend to be less compatible, since they’d more likely have had crossovers between them. Our provisional model for natural transformation would not necessarily have this feature, however, since we do not expect our recombination events to be crossover-like.

The authors of PhiTest produced a different measure than the NSS model that looked at the “Pairwise Homoplasy Index” for each pair of SNPs in an alignment. This gives a test statistic of the minimum number of homoplasies in the alignment, essentially based on a parsimony criterion. They could then apply permutation tests to measure the significance of a given PHI for an interval. Happily, the program will also output the significance under the NSS and MaxChi models.

For my alignment, PhiTest gave the probability of no recombination as 0 for all three models. So it looks like there’s quite a bit of homoplasy in my alignment, evidence of recombination. (It’s worth noting that PhiTest and LDhat both explicitly ignore gaps in the alignment.)

But that’s not good enough, what I really want is a plot of recombination rates across the alignment block. The figure I made from this post was the output of that analysis. It is only showing the ~9500 polymorphic sites and ignoring all the gaps and identical sequences. It’s showing, for each pair of SNPs, the probability of recombination (assuming that all homoplasies are due to recombination and not recurrent mutation).

It pretty much looks like there’s evidence of recombination almost everywhere! There’s just a couple areas that look like they have extensive compatibility, so it may actually be a lot easier for us to detect places where there’s been little recombination than places where there’s been lots.

I’ll need to go back to the raw data and figure out which set of SNPs are showing evidence of low recombination and see where those spots are in the alignment. They could be horizontally transferred segments with few USS?

I’m not sure what to make of this. It complicates things. To what extent can I trust these algorithms to find recombination and not recurrent mutation? The divergence between my sequences isn’t that high (only ~3%), so recurrent mutation doesn’t seem likely to be a huge problem, but still...

Anyways, I went ahead and used the other utility that comes with PhiTest, called Profile. This does PhiTest on sliding windows, providing a P-value for each segment. Here’s what it looked like, when I did a “Profile” of non-overlapping 500 bp chunks of my alignment:
The x-axis is the position in the alignment, and the y-axis is the p-value of the PhiTest for that particular 500 bp chunk. The line indicates a P-value threshold of 0.05, so any segment below that line has evidence of recombination, based on PHI. Segments that have higher P-values may either have compatible SNPs, or for the very high P-values, be sites with very little SNPs.

This seems promising, if we can trust that we’re measuring something meaningful. I have a strange feeling that using homoplasy to measure recombination is going to be fraught with difficulties, but it’s pretty much the best principle that we can work with to detect recombination in these population-level samples. Luckily, both PhiTest and the other methods are sophisticated enough that they are evaluating the homoplasy index for a given pair of SNPs relative to other surrounding SNPs, so when incompatibilities are detected, they are standing out from other possible pairings. This should help with the problem of true homoplasy, but then again, maybe it doesn't. I'll need to talk to my friend Corbin about this stuff.

Whew! Okay, enough on this for the moment...

More alignments!

I realized something disturbing a couple days ago, when I crashed my computer by running out of hard-drive space... The Haemophilus influenzae genome is actually replicating inside my computer! Not just on my lab bench! Several of the programs I've been using (namely the alignment programs) have been spitting out files containing the complete genome, and I think I must've had several hundred before I got rid of a bunch of intermediary files. In silico infection! A scourge!

Anyways, I guess that last entry deserves more explanation, but it'll take me more than a post to do it. First off, the motivation for doing these kinds of analyses:

There’s 4 complete Haemophilus influenzae genomes, 11 with draft assemblies (in 4-50 contigs), and 17 for which sequencing is in progress. On top of that, hundreds of strains have been sequenced at seven house-keeping genes (MLST studies).

Within this “population genomic” data, we should be able to do some estimates of recombination rates between isolates, as well as some sophisticated analyses of genetic variation at USS motifs. But how? Here’s the basic idea:

(1) Align the complete and draft genomes.
(2) Use programs written to detect recombination signals in population genetic data.
(3) Examine genetic variation in and around USS motifs (scored for information content across each genome) and see if they correlate with the determined recombination signals.

There are several issues with this at each step, but I’ll take the first one first. I’ll start with a post describing how I produced the multiple alignment file that I fed to the PhiTest program...

The first trick was downloading the files...
I already had the four complete sequences, but hadn’t had the easiest time getting the contigs of the draft assemblies. Do they really want me to download each contig by clicking one at a time? My problem was solved when I found the whole-genome shotgun (WGS) FTP site at NCBI. Of course, these files were named by their WGS code, so I simply hand-search for the correct four letter code, based on this list and downloaded the corresponding file. If I’d been more clever, I would’ve done this from the command line with a script containing the list of WGS data I wanted.

I used the search term "txid727[orgn]" in the Genome Database or the Taxonomy Browser to call up Haemphilus influenzae genomes. Using the Genome database yielded some spare entries that were plasmid sequences, and using Taxonomy browser yielded many isolates where the sequencing is still in progress. Anyways, I could get to a Genome page for each sequenced isolate, but when I clocked to get the contigs, it provided separate accessions for each contig. Boo! There had to be a better way, which took some digging, but was here. This is all the WGS data at NCBI. Whew! Luckily, I found the requisite four-letter codes from my other searches and downloaded the appropriate *.gbff.gz files and expanded them with gunzip.

Anyway, this gave me a bunch of GenBank files with all the contigs in each one. The contigs themselves were simply ordered by size, with the largest first and the smallest last, and the coordinate system corresponded to this ordering. While this is fine, it would probably help me looking at alignments, if I ordered the contigs based on the reference KW20 Rd genome. It would be a meaningless gesture in some ways, but would certainly facilitate looking at pictures.

I did this by using MAUVE’s MoveContig function. I reordered each set of WGS contigs pairwise against the reference KW20 Rd. (Again, I should've been using the command-line, but found it easier to just click my way through... shameful.) Here’s an example of what MoveContigs did comparing PittII to Kw20 Rd:

Alignment #1: Keeps the original order of the contigs (from largest to smallest).
KW20 Rd is on top; PittII is on bottom. The colored blocks indicate contiguous aligned blocks. The red lines indicate contig boundaries (KW20 has only two, for the start and stop of the chromosome). The diagonal line between the two genomes show which blocks belong together. It’s quite obvious from all the crosses that the order of the contigs is off.

The program then reordered the contigs and did another alignment. Then it did this a third time (it keeps iterating until it’s done). It decides this is as good as it can do and stops.

Alignment #3: Reordered contigs maximizing “synteny” between the reference and PittII.
Clearly reordering the contigs has made the picture clearer to see. There are far fewer alignment blocks and fewer diagonally crossing central connector lines. Of course, some of the contigs did span some breakpoints, so there’s obviously rearrangements between the genomes. At each contig boundary, there is always the possibility of some rearrangement between the strains, so this contig order is at best provisional. Another thing I had to keep in mind while tooling around with these alignment programs is that the genome is circular, but the programs don’t treat the strings that way, so there’s often things at the right or left ends that obviously are still syntenic, but just don’t look like it with these linear visualizations.

Here’s the alignments again, without the colored blocks, just so that the reordering of the contigs is quite obvious:

Alignment #1:
Alignment #3:
Notably, the little tiny contigs that failed to find something in the reference to align with remained at the right end.

Alright, so going through 11 of these MoveContig steps provided me with 11 new files that had the contigs in a different order (based on KW20 Rd) than what I got from GenBank. The only unfortunate thing about this was that MoveContigs only outputted FastA, instead of GenBank, so I lost all the annotations. Oh well. I would still be able to use the KW20 Rd annotations. I think that the annotations can still be added back, but I haven’t figured that out yet.

To show that this exercise wasn’t a waste, below I show the stripped-down alignments when I did the big 15-way multiple alignments (I hid the 3 other completely sequenced strains from view, and only the top several contig reorderings are visible.

Before Contig Reordering:
After Contig Reordering:
Much easier to look at... But still sort of meaningless.

Alright! So that’s a 15-way multiple genome alignment. And indeed, it’s associated with a gigantic alignment file in the XMFA format: an extension of the FastA format that allows for a description of alignments that have rearrangements / contig breakponts / etc between them.

The most difficult challenges:
(1) Handling the breakpoints, contig boundaries and real rearrangement breakpoints alike.
(2) Handling indels. Those within an aligned block should be scorable, but how?

So really, the first thing I can work with without much difficulty are contiguous aligned blocks. I grabbed one of these from the XMFA file to produce a single FastA file, which I called test1.fasta, which I planned to use for some preliminary analyses...


Wednesday, August 26, 2009

Measuring recombination?

Whew! I've been slacking on the blog posts! Above: Probability of recombination between pairwise SNPs in a 15 strain alignment block of ~65 kb. Yellow indicates low probability of recombination...

The figure above was generated by PhiTest, a program that tries to identify regions of recombination between different taxa in a multiple alignment. Last night, I used MAUVE to generate a 15-way multiple alignment between the genome sequences of 15 different Haemophilus infleunzae isolates. (This took a while, but still was impressively quick.)

I grabbed a reasonably large alignment block (65,480 bp, including gaps) and stuck it into Phitest. It did everything reasonably quickly and gave me the unshocking result that there was a p = 0E+00 chance of no recombination occurring on the interval. So I decided to use the -g switch to provide a graphical output giving the chance of recombination between every pair-wise combination of SNPs (of which there were 9422; the program ignored indels).

This was fun, because it wrote such a gigantic file that my computer died due to lack of hard-drive space! That doesn't happen often! But it was mainly because I have several dozen gigabytes of stuff on my laptop that doesn't belong. I nudged a fraction of this aside and repeated the analysis to give the plot above.

Thursday, August 13, 2009

Uptake and Transformation with "Biorupted" samples

To get an idea of how uptake and transformation would work with different sized donor chromosome fragments, I took MAP7 DNA and sheared it in a “bioruptor”. I ended up with several samples with different size distributions, three of which I used for a pair of experiments:

LARGE: >40 kb (unsonicated)
MEDIUM: 1-10kb (1 X 10 min sonication)
SMALL: 100-400 bp (5 X 10 min sonication)

My na├»ve assumption was that % uptake would go down as the fragment size decreased, since fewer fragments would contain “uptake signal sequences” (USS), which have an average density in the genome of ~1kb.

I also thought that transformation rates would also go down for smaller fragments, but would not necessarily correlate that well with uptake, since additional steps of translocation and recombination could also potentially influence the efficiency of transformation. So for example, transformation might drop off more quickly than uptake, if degradation would affect smaller fragments more than larger fragments. (This seemed to be the case in Pifer and Smith, 1985.)

Keeping in mind that these are just one-off experiments and need to be repeated (like pretty much every experiment I’ve reported in this blog), the above predictions look like they’re true, but I’m not certain if my reasons are necessarily correct...

First, I’ll show the uptake data. I end-labeled the MEDIUM and SMALL donor fragments using Klenow and did a simple uptake experiment (comparing total radiolabel to that in cell pellets after 30 min of uptake) using wild type cells and as donors, either MEDIUM or SMALL chromosome fragments, and either saturating (500 ng) or sub-saturating (100 ng) amounts of input DNA per 0.5 ml of wild-type competent cells. Here’s the data:
So clearly, several-fold less chromosomal DNA is taken up when smaller fragments (100-400 bp) are used than when larger fragments (1kb-10kb) are used. As explained above, one reasonable explanation for this is that fewer fragments in the SMALL sample contain USS, so maybe only 1/10 to 1/2 will have a USS, whereas in the MEDIUM sample most fragments will contain at least 1 USS.

But there is an alternative explanation that I’d like to be able to distinguish (but am pretty sure I can’t with this one experiment). It could be that a given competent cell only takes up a fixed number of DNA fragments, independent of fragment size. So since the SMALL sample is composed of ~10-100X more fragments per unit mass, it could be that I’ve simply saturated the system with fragments in the case of SMALL, but not in the case of MEDIUM. This issue was addressed by Deich and Smith, 1980, and they concluded that indeed this was the case (that the number of molecules taken up was independent of fragment size), but while they do mention USS, they do not bring up USS density as a potential reason for their data.

I’d hoped that by doing a second DNA concentration (100 ng) that I might get hints as to which of the two above models is correct (or if they are both correct and both contribute to the observation), but I don’t really think I can say too much without repeating this several times and getting some error bars on that graph. Furthermore, I’m not really entirely sure what the expectations are for the two models. I’ll have to think on this some more....
Okay, what about transformation rates using my “biorupted” fragments? Below are two graphs reporting the transformation rates of the KanR and NovR alleles from MAP7 to KW20 for the three different DNA pools (LARGE, MEDIUM, and SMALL):
(Note: In the case of the NovR/CFU SMALL sample, the number reported is actually the limit of detection, so NovR/CFU(small) is less than 4.4e-6. I didn't observe any NovR transformants for the SMALL fragments, despite having a decent limit of detection.)

First, I'll look at the difference between the LARGE and MEDIUM fragments. Both markers showed ~6-fold decrease in transformation rate in the briefly sonicated sample, compared to the large intact fragments. Possible explanations:
  1. Less USS per fragment: I doubt this is a major reason for the difference. While a larger percentage of fragments are expected to contain no USS in the MEDIUM sample, it shouldn’t be that large of a difference, since the mean density of USS motifs is ~1kb.
  2. Degradation by translocation or cytosolic nucleases: This seems to be a reasonable explanation. From an old set of experiments using a defined plasmid donor, Pifer and Smith, 1985 estimated that an average of ~1.5 kb of a leading 3’ end is degraded during translocation. Maybe the medium-sized fragments simply don’t survive translocation as well as large taken up fragments.
  3. Recombination efficiency: Maybe both the LARGE and MEDIUM fragments make it into the cytosol, but homology search and recombination are much better for larger fragments.
Things look a little more interesting when looking at the change between MEDIUM and SMALL fragments: While the KanR rate only changed modestly (less than 2-fold), the NovR rate went down below my limit of detection. Possible explanations:
  1. Distance to USS: The nearest USS to the NovR allele is more than 4oo bp away, but less than 400 bp for the KanR allele: I like this explanation. I really really need to figure out the identity of these antibiotic resistance alleles. We’re pretty sure it’s a mutation in the gyrB gene, but I don’t know the actual change. I looked at gyrB and it does contain a USS core motif and two other core motifs a few hundred bases before the start codon, but the gene is ~2.5 kb, so the actual gyrB mutation could easily be too far away from these USS. When I looked at the putative gene responsible for KanR (the ribosomal S7 gene), there was a single USS near the start, but none within. Again, without knowing the causative lesion, I can’t tell whether this is within the size distribution of the SMALL fragments.
  2. Differences in degradation rates at the two loci: This is possible. The other thing these data suggest is that, despite the ~1.5 kb average degradation reported by Pifer and Smith, 1985, there’s still plenty of small fragments that can recombine, since the KanR rates between MEDIUM and SMALL are not really dramatically different.
  3. Recombination signals: Also possible. And probably the hardest to tell, since the other effects need to be canceled out.
The main take-away (assuming that the results replicate) is that not only do different markers transform at different rates, the change in transformation rate for different sized-fragments also varies for different markers. The underlying reasons for this are probably interesting. So what next? I need to repeat this, but next time I’d like to:
  1. Extend this to additional markers to see if this variability also applies to other loci.
  2. Measure linkage between Kan and Nov. I’ve previously seen the known linkage between Kan and Nov using large fragments, but would expect linkage to vanish for small fragments when the KanR and NovR alleles never share the same fragment.
And again, I really need to know what the lesions are that are responsible for the MAP7 antibiotic resistances. I’ve looked around a fair amount, but it seems that many antibiotic resistances can be produced by mutations in more than one different gene, so narrowing it down isn’t that straightforward.

Monday, August 10, 2009

Back to the grind

Whew! Getting that albatross (= grant application) off my neck feels good, but now I need to formulate a plan for the next several weeks of lab work. What are my priorities? There’s quite a lot of things I could be doing, so I might as well make a list...

(A) Generate purified DNA from 86-028NP, PittEE, and PittGG donor DNA: Should be pretty easy. I tried this a couple times in the past couple weeks, but had a few problems:
  1. Got a goopy mess. I don’t know if it was my lysis or something about my extractions, but when I would try to pull the aqueous layers off of my phenol/chloroform extractions, this goopy junk at the interface kept interfering with my ability to complete the preps. I will try try again, but this next time I will do a few things differently: (i) start with fewer cells per volume, (ii) do a straight phenol extraction first, (iii) let lysis proceed longer.
  2. PittEE didn’t grow. I’d gone to the lab stocks, streaked for single colonies, grown up overnight cultures from single colonies, and stored them in glycerol in the -80. But when I returned to PittEE frozen stocks, they don’t grow up! Weird. Maybe PittEE is particularly bad at surviving outside log-phase, so I had merely frozen down a pile of dead cells. I’ll have to return to the lab stocks and restreak for single colonies.
  3. I never checked the putative nalidixic acid resistant phenotype of 86-028NP. I really hope that marker will work to select for transformants from 86-028NP to KW20, as this would really help several steps of the project. In our strain database, 86-028NP is listed as NalR, but this might only be a clinically relevant phenotype. I need to check it on plates.

(B) Check the uptake and transformation phenotypes of WT, rec-2, rec-1, and pilB: I’ve already gone through this exercise with wt and rec-2, but as a negative control, pilB mutants should neither take up DNA nor be transformed by it, and for my translocation experiments, rec-1 should take up DNA normally but fail to transform. I have all these strains frozen down as M-IV competent cells, so I just need to do the experiments to verify that my mutants have the expected phenotypes. The rec-1 mutant in particular is important, since I plan to use it for cytosolic DNA purifications, but it could be somewhat leaky, something I need to know. (The only rec-1 mutant available is the original isolated mutant from back in 1972. It’s older than me!

(C) Quantify uptake and transformation for different-sized donor DNA: A coupe weeks ago, I used a “Bioruptor” to sonicate MAP7 DNA to different size distributions, and I tested the transformation rate of antibiotic resistance into competent cells of some of these different sized DNA. I also want to check how well they’re taken up. Once I’ve done this, I’ll report back the full set of results. The expectation is that while even small fragments will be taken up well, only large fragments will transform well. There’s also an interesting expected relationship between the amount of uptake expected and the size distribution of the DNA. If the DNA fragments are quite small, then only a fraction are expected to contain uptake signals, but many fragments containing such signals will be taken up. On the other side, large DNA fragments will mostly all contain uptake signals, but only a few molecules will be taken up, but these will be larger... Hmmm....

(D) Design the degenerate USS construct. This will require its own post, but suffice it to say, we have a pretty good plan on how to design this construct so that we will be able to directly sequence the degenerate USS motif by short single-end Illumina sequencing without any processing steps upfront.

(E) Transformation experiment: If indeed 86-028NP can grow on nalidixic acid plates, then I will just go ahead and do half of the experimental-side of one of my specific aims. All I need to do is transform KW20 competent cells (which I have) with DNA taken from 86-028NP (which I’ll make), select for colonies that are NalR, pick several of them, grow them up, and extract their DNA. This would provide up with plenty of material to produce sequencing libraries and then send to sequencing. Then we can really get this project off the ground.

(F) Find somewhere that will do sequencing with a short turn-around: The local Illumina sequencers (at UBC’s Genome Center) offer what looks like excellent sequencing services for a reasonable cost, but their turn-around time is a bit too slow for us to hope to have any preliminary data for Rosie’s grant applications. If we want to get anything done more quickly, we will have to find someone else to get started.

(G) Design and produce a large DNA fragment from the 86-028NP genome to use for developing a cytosolic ssDNA prep: Since the most difficult experimental part of our plans is likely to be purifying (and cloning) ssDNA from the cytosol, I’d like to start with a defined construct I can use for purification experiments. I think that a good choice would be a large clone from 86-028NP, because this might be useful for other experiments as well. I will want something at least several kb on a plasmid. I will need to think about whether or not it matters what this fragment contains. As a first guess, I would want it to contain the putative NalR allele and several kb of flank.

What else should I be doing?

Thursday, August 6, 2009


...there... However it turns out, I'll finally have this grant I've been working on out of my hands tomorrow. I'm actually thrilled to get back into the lab when it's all over...