Friday, May 29, 2009

Enumerating their differences

I took a break from working on GBrowse today to try and do some genome-wide alignments of the sequenced Haemophilus influenzae isolates, and it's gone remarkably smoothly. The more I play around with the command-line, the more I'm finding it an excellent and efficient way to compute. Still hurts my head after a while.

I went to GenBank and downloaded FASTA and GenBank files for the four completely sequenced strains: our reference KW20-Rd and three clinical isolates found in ear infections...

I first used a program called MAUVE, recommended by my predecessor postdoc, which produced the lovely looking colorful plot above, comparing KW20-Rd and 86-028NP. It was extremely easy to use (though I only tried default settings). The only thing I had to do was change the GenBank file extensions from .gb to .gbk and everything worked great.

The colored blocks indicate syntenic regions and the relative sequence similarity within a block is indicated on the y-axis. When a block is in an inverted orientation, it falls below the line. When there's white-space, that's indicating a large insertion absent from the other strain. I can sort of piece together the rearrangements separating the two strains by eye, but even easier is a button in the MAUVE window that took me to a GRIMM analysis page that simply gave me a minimal path of inversions between the sequences. Sweet! It counts six big inversions (some overlapping) to give the rearrangements needed to get from KW20-Rd to 86-028NP.

My next task was to try and actually enumerate the differences between the strains. MAUVE makes a pretty picture (and also has a nice browser-like visual annotation based on the GenBank file), but I'd also simply like to count all the SNPs, indels, and other rearrangment breakpoints between the strains. I couldn't figure out how to do this in MAUVE and didn't feel like trying to parse its alignment file.

So I got an altogether different genome-wide alignment suite of programs called MUMmer. At first, I was scared of it, since it involved command-line arguments, and I'm sort of fried on that for the week. But after installing and compiling the suite of programs, I found it incredibly easy to use. For this I used FASTA files, and invoked the dnadiff program, which uses a bunch of the MUMmer utilities for pairwise comparisons. It did each pairwise analysis in under 10 seconds, providing me with a series of different output files. I was impressed.

I spent most of the day learning what these files contained and playing with the settings, and it's pretty much got everything there I want, as long as I can parse the files correctly. I decided to start with the SNPs between the reference KW20-Rd genome and the other three. There as one little glitch in the nice .report output file, in which it didn't call all classes of SNPs (it was excluding G-T differences from the report), so I dug into the code and successfully altered it to give the full list. Here's what I found:
There's >40K SNPs between the reference and each of these others. As would be expected, the transition/transversion ratios are ~2. The number of SNPs going from X->Y and Y->X were effectively identical for all comparisons, so I added them together in the above plot. It also looks like toggling between G and C is a particularly difficult transversion for some reason. Anyways, piles of SNPs, as expected.

For my next trick, I'll look at the distribution of indel and rearrangement polymorphism...


  1. Wow, mucho progress! I should go away all the time...

  2. I should add that in the code for, there was a hilarious comment about 3/4 of the way through the >700 line file, right before the bit I needed to change:

    #-- If you are reading this, you need to get out more...