Friday, April 2, 2010

SNP densities

So I’ve been writing yet another grant, which has been distracting me from blogging (this isn't supposed to be a monthly blog, but this will hopefully be the last grant application for a while).

But I’ve also been doing several analyses lately. Here’s one. I took the sequences of an ~300 kb restriction fragment from three H. influenzae isolates (Rd, 86-028NP, and PittGG). They’re all similarly divergent from each other (~2.5%), and I wondered how well the level of divergence of Rd vs NP and Rd vs GG correlated along the chromosome...

So I aligned the sequences in Mauve, took its SNP calling output, and did a couple simple sliding window analyses inside R (using the zoo package for rolling means). Here’s what divergence looked like averaged over 5 kb windows (click to enlarge):
The divergence between Rd and the two other isolates are quite well correlated (r2= 0.8, using linear modeling). But since NP and GG are similarly divergent, I made two other plots.

First, here’s a comparison of the density of SNPs that are shared by NP and GG and those that are unique to either NP or GG:
The correlation is a lot worse (r2=0.4).

And if I further break the “unshared” line into NP and GG-specific SNPs (i.e. positions are different between Rd and NP but not GG, and vice versa).
The correlation is worse still (r2=0.2)

Similar results applied to smaller windows, but the plots looked a lot messier. Note that it’s not exactly totally straightforward to measure SNP density... What does one do at indels?? I just ignored them, so the results above are rough. Part of the reason I focused on only a co-linear segment of chromosome was to minimize this problem, but there are still several indels between each of the three strains.

Indels aside, what’s this mean? One of the goals of my transformation frequency mapping is to be able to distinguish the effects of sequence divergence on transformation from the effects of other local chromosomal properties (base composition, sequence motifs, etc.). Since NP and GG have correlated SNP densities relative to Rd, transformation frequencies across the Rd chromosome are expected to also be correlated. Discrepencies in transformation frequency by NP and GG donors could indicate that SNPs specific to the isolates are somehow modulating transformation independent of divergence per se.

Distinguishing chromosome “position effects” from sequence divergence will probably require a third donor DNA. Deciding what this would be requires some thought. All of the sequence H. influenzae are similarly divergent from Rd (and for the most part each other), and phylogeny poorly distinguishes separate clades (i.e. they kind of give a star phylogeny).

So I should use either a strain much more closely related to Rd or one more distantly related (perhaps another species). Using a closely related strain has the advantage that transformation frequencies are expected to be higher and divergence will play less of a role, making the focus more on divergence-independent factors, but I would also have far fewer markers.

Based on MLST comparisons, several strains are sisters of Rd (RM7033, RM7429, RM7271). These assignments are made in several phylogenetic and put the three at ~0.5% divergent from Rd. So I would expect that RM7033 (for example) would have ~6000 SNPs from Rd (far more than our Rd or the other sequenced Rd), ample to have markers across the chromosome...

No comments:

Post a Comment