Friday, August 28, 2009

Homoplasy versus Recombination

Making up for lost blogging time!

Just to get started doing something with this alignment, I just grabbed a single 68,580 bp alignment block out of the XMFA file I generated (as described in the last post), which was itself a simple FastA formatted file. So I could use whatever programs could handle a simple FastA alignment and not worry about converting around to different formats yet, or coping with contig and rearrangement boundaries.

Here’s a picture from MAUVE of the aligned block I used (KW20 Rd reference coordinates 401,890 to 467,839). It’s the orange block shown below.
Based on some recent reading, I tried a couple different programs designed to detect signals of recombination from multiple alignment files. There is a serious caveat here, in that I do not understand exactly how these programs work, and since they were designed for use on sexual eukaryotes, it’s quite possible that the assumptions of the programs will not be accurate when applied to Haemophilus. Nevertheless, I sallied forth, just to see if they’d at least work with my sequences. I can come back and try to understand the meaning of the output more slowly...

LDhat

First, I tried LDhat, which is composed of a suite of programs aimed at detecting recombination in a population sample. I didn’t have any difficulty downloading and compiling the program, which was nice. I wish I could tell you more about it, but I never actually made it work, so there’s not much point yet. I got the “Convert” program to make the requisite files from an only slightly modified FastA, and I eventually got the “Pairwise” program to work. This program generates a “look-up table” which is required for the other programs to operate.

At first, the program kept choking on my data and was unable to produce the requisite look-up table for my data. So I fed it a pre-made file that can be found at LDhat’s website. The look-up table I used was first converted to a 15 taxa file using their “LKgen” utility. Then “Pairwise” worked and generated me a new look-up table that should theoretically have worked with the other programs. I next used the “Interval” program, but I could never get past here. It wouldn’t do the analysis, because the look-up table was somehow lacking. It was not “exhaustive”, whatever that means, and there was apparently a disparity between “npt” (which I assume stands for Nonparametric test) and the data.

No clue. Moving on. (I’ll try to return to this some other time, since the program “Rhomap” looks like what I’m looking for).

PhiTest

Next I turned to PhiTest, which was also easily downloaded and compiled. This program uses a principle that seems pretty sensible. It examines “incompatibilities” in phylogenetic signals. If two lineages of bacteria diverge and never recombine, then adjacent polymorphisms will most likely be “compatible”, that is both polymorphic sites will have the same phylogenetic signal (i.e. they will support the same tree topology). On the other hand, “incompatible” sites could have two possible histories, one in which there was recurrent mutation in different lineages and one in which there was recombination between lineages. This is illustrated nicely in a figure from the paper reporting PhiTest:

"Figure 1:
The dual nature of incompatibility. Two possible histories for a pair of incompatible sites are shown: (a) two incompatible sites explained by a recombination event and (b) two incompatible sites explained by a convergent mutation. Mutations in the first site are indicated by open circles and mutations in the second site are indicated by solid circles. To explain the incompatibility between the pair of sites either a recombination event must be invoked or a homoplasy must have occurred in the history of one of the sites."

Figure 1A:
Figure 1b:
So in (a) the black mutation only occurred once and was subsequently shared between the central lineages, whereas in (b) the black mutation occurred independently in the two lineages.

They more rigorously define “compatibility” thusly: “Two sites i and j are compatible if and only if there is a genealogical history that can be inferred parsimoniously that does not involve any recurrent or convergent mutations (known as homoplasies as in Figure 1b). If the two sites are not compatible, they are termed incompatible. Under an infinite-sites model (KIMURA 1969) of sequence evolution, the possibility of a homoplasy does not exist, and so incompatibility for a pair of sites implies that at least one recombination event must have occurred, as in Figure 1a.”

So the idea of recombination detection algorithms is to exploit the notion of incompatibility, but it is extremely important to note that there are really two possibilities for any incompatibility: (a) It is a true “homoplasy”, i.e. a recurrent mutation; (b) it indicates recombination. The basic idea of the NSS method (which is apparently what LDhat is using) is to look at neighboring pairs of sites and find regions where incompatibilities are clustered together, suggesting perhaps a hotspot.

The other, possibly more eukaryotic, aspect is that sites that are more distant will tend to be less compatible, since they’d more likely have had crossovers between them. Our provisional model for natural transformation would not necessarily have this feature, however, since we do not expect our recombination events to be crossover-like.

The authors of PhiTest produced a different measure than the NSS model that looked at the “Pairwise Homoplasy Index” for each pair of SNPs in an alignment. This gives a test statistic of the minimum number of homoplasies in the alignment, essentially based on a parsimony criterion. They could then apply permutation tests to measure the significance of a given PHI for an interval. Happily, the program will also output the significance under the NSS and MaxChi models.

For my alignment, PhiTest gave the probability of no recombination as 0 for all three models. So it looks like there’s quite a bit of homoplasy in my alignment, evidence of recombination. (It’s worth noting that PhiTest and LDhat both explicitly ignore gaps in the alignment.)

But that’s not good enough, what I really want is a plot of recombination rates across the alignment block. The figure I made from this post was the output of that analysis. It is only showing the ~9500 polymorphic sites and ignoring all the gaps and identical sequences. It’s showing, for each pair of SNPs, the probability of recombination (assuming that all homoplasies are due to recombination and not recurrent mutation).

It pretty much looks like there’s evidence of recombination almost everywhere! There’s just a couple areas that look like they have extensive compatibility, so it may actually be a lot easier for us to detect places where there’s been little recombination than places where there’s been lots.

I’ll need to go back to the raw data and figure out which set of SNPs are showing evidence of low recombination and see where those spots are in the alignment. They could be horizontally transferred segments with few USS?

I’m not sure what to make of this. It complicates things. To what extent can I trust these algorithms to find recombination and not recurrent mutation? The divergence between my sequences isn’t that high (only ~3%), so recurrent mutation doesn’t seem likely to be a huge problem, but still...

Anyways, I went ahead and used the other utility that comes with PhiTest, called Profile. This does PhiTest on sliding windows, providing a P-value for each segment. Here’s what it looked like, when I did a “Profile” of non-overlapping 500 bp chunks of my alignment:
The x-axis is the position in the alignment, and the y-axis is the p-value of the PhiTest for that particular 500 bp chunk. The line indicates a P-value threshold of 0.05, so any segment below that line has evidence of recombination, based on PHI. Segments that have higher P-values may either have compatible SNPs, or for the very high P-values, be sites with very little SNPs.

This seems promising, if we can trust that we’re measuring something meaningful. I have a strange feeling that using homoplasy to measure recombination is going to be fraught with difficulties, but it’s pretty much the best principle that we can work with to detect recombination in these population-level samples. Luckily, both PhiTest and the other methods are sophisticated enough that they are evaluating the homoplasy index for a given pair of SNPs relative to other surrounding SNPs, so when incompatibilities are detected, they are standing out from other possible pairings. This should help with the problem of true homoplasy, but then again, maybe it doesn't. I'll need to talk to my friend Corbin about this stuff.

Whew! Okay, enough on this for the moment...

3 comments:

  1. Does the length of recombination (conversion) tracts have a similar effect as the distance between crossovers (linkage) in sexual eukaryotes? I realize that the proportionality will be different - linkage effects act over longer distances, but is the direction the same or reversed?

    ReplyDelete
  2. Gosh. I've been thinking about this, and I have a feeling it's really really different. Having dug around these programs more, I'm pretty sure that the program I need to make work is the "Pairwise" program in the LDhat package. It seems they implemented a gene conversion model in this program, which is really what we want. However, I am still flummuxed by how they measure statisitical significance. While the permutation tests they do makes plenty of sense for crossover recombination, I'm not sure it does for the gene conversion case.

    The whole thing seems a bit tricky... BUT one of the authors of LDhat is an old colleague of mine, so I'm going to harass him for some help, if I can track him down...

    ReplyDelete
  3. Hi
    Long after this post , and just in case sombody else lokks for the information: For the LDHat suite, interval is the only programm you can use without creating a lookup table. For other programs like rhomap you have to make a look-up table with complete -seq -loc then rhomap will run smoothly as far as you can answer for the different parameters.
    Best regards.
    JLL

    ReplyDelete