Tuesday, April 13, 2010

Mining some old array data

So in an effort to re-examine some of the lab’s old array data, I made a fairly simple R script to plot the change in expression of competence genes, putative purR-regulated genes, and genes involved in utilizing secondary sugars. We no longer have our expensive license for fancy-pants software, but all I needed to do was some arithematic to columns, and then find the rows of interest, so it’s R-tastic!

I looked at a time course dataset, in which expression was monitored over the course of growth in sBHI and after transfer to MIV. I also looked at a single one-off array comparing purR- to purR+ strains growing in late-log +cAMP.

Here’s the results for the time course. All values are normalized to the first time point. Blue are sBHI timepoints, and red are MIV timepoints. MIV cultures were split from the sBHI cultures at t=0 minutes.

It’s pretty clear that the competence genes are strongly induced in MIV, but are also induced in late-log phase, as expected. Putative PurR-regulated genes are strongly and quickly induced in MIV, indicating that purine pools are quickly depleted, and the purine biosynthetic pathway is activated quite quickly (much faster than the competence genes, it appears). The “non-PTS” genes (several genes induced by CRP when cAMP levels are high) appear to be briefly weakly induced in MIV, as well as being weakly induced in late-log.

Here’s the same sets of genes plotted as the ratio of expression in purR- vs purR+ cultures (late-log, induced with cAMP). Here, I plot the ratios from both array elements for each gene (open and closed circles) and colored them just so they’d be easy to see. Also note, I normalized everything to the median ratio to account for dye effects (under the assumption that the median gene is not PurR regulated). Again, strong induction of the putative purine-regulated genes, a weak repression of the competence genes (presumably due to purine repression), and not much happening with the non-PTS sugars.

Conclusion: Nothing we didn’t already suspect, but it’s good to see that things are behaved as expected. One point of note is that the hypothesized regulation of rec2 by PurR isn’t something that jumps out of this, but if purine repression acts upstream of rec2, we wouldn’t be able to see the effects of deleting PurR here anyways…

Friday, April 2, 2010

SNP densities

So I’ve been writing yet another grant, which has been distracting me from blogging (this isn't supposed to be a monthly blog, but this will hopefully be the last grant application for a while).

But I’ve also been doing several analyses lately. Here’s one. I took the sequences of an ~300 kb restriction fragment from three H. influenzae isolates (Rd, 86-028NP, and PittGG). They’re all similarly divergent from each other (~2.5%), and I wondered how well the level of divergence of Rd vs NP and Rd vs GG correlated along the chromosome...

So I aligned the sequences in Mauve, took its SNP calling output, and did a couple simple sliding window analyses inside R (using the zoo package for rolling means). Here’s what divergence looked like averaged over 5 kb windows (click to enlarge):
The divergence between Rd and the two other isolates are quite well correlated (r2= 0.8, using linear modeling). But since NP and GG are similarly divergent, I made two other plots.

First, here’s a comparison of the density of SNPs that are shared by NP and GG and those that are unique to either NP or GG:
The correlation is a lot worse (r2=0.4).

And if I further break the “unshared” line into NP and GG-specific SNPs (i.e. positions are different between Rd and NP but not GG, and vice versa).
The correlation is worse still (r2=0.2)

Similar results applied to smaller windows, but the plots looked a lot messier. Note that it’s not exactly totally straightforward to measure SNP density... What does one do at indels?? I just ignored them, so the results above are rough. Part of the reason I focused on only a co-linear segment of chromosome was to minimize this problem, but there are still several indels between each of the three strains.

Indels aside, what’s this mean? One of the goals of my transformation frequency mapping is to be able to distinguish the effects of sequence divergence on transformation from the effects of other local chromosomal properties (base composition, sequence motifs, etc.). Since NP and GG have correlated SNP densities relative to Rd, transformation frequencies across the Rd chromosome are expected to also be correlated. Discrepencies in transformation frequency by NP and GG donors could indicate that SNPs specific to the isolates are somehow modulating transformation independent of divergence per se.

Distinguishing chromosome “position effects” from sequence divergence will probably require a third donor DNA. Deciding what this would be requires some thought. All of the sequence H. influenzae are similarly divergent from Rd (and for the most part each other), and phylogeny poorly distinguishes separate clades (i.e. they kind of give a star phylogeny).

So I should use either a strain much more closely related to Rd or one more distantly related (perhaps another species). Using a closely related strain has the advantage that transformation frequencies are expected to be higher and divergence will play less of a role, making the focus more on divergence-independent factors, but I would also have far fewer markers.

Based on MLST comparisons, several strains are sisters of Rd (RM7033, RM7429, RM7271). These assignments are made in several phylogenetic and put the three at ~0.5% divergent from Rd. So I would expect that RM7033 (for example) would have ~6000 SNPs from Rd (far more than our Rd or the other sequenced Rd), ample to have markers across the chromosome...