Friday, August 28, 2009

More alignments!

I realized something disturbing a couple days ago, when I crashed my computer by running out of hard-drive space... The Haemophilus influenzae genome is actually replicating inside my computer! Not just on my lab bench! Several of the programs I've been using (namely the alignment programs) have been spitting out files containing the complete genome, and I think I must've had several hundred before I got rid of a bunch of intermediary files. In silico infection! A scourge!

Anyways, I guess that last entry deserves more explanation, but it'll take me more than a post to do it. First off, the motivation for doing these kinds of analyses:

There’s 4 complete Haemophilus influenzae genomes, 11 with draft assemblies (in 4-50 contigs), and 17 for which sequencing is in progress. On top of that, hundreds of strains have been sequenced at seven house-keeping genes (MLST studies).

Within this “population genomic” data, we should be able to do some estimates of recombination rates between isolates, as well as some sophisticated analyses of genetic variation at USS motifs. But how? Here’s the basic idea:

(1) Align the complete and draft genomes.
(2) Use programs written to detect recombination signals in population genetic data.
(3) Examine genetic variation in and around USS motifs (scored for information content across each genome) and see if they correlate with the determined recombination signals.

There are several issues with this at each step, but I’ll take the first one first. I’ll start with a post describing how I produced the multiple alignment file that I fed to the PhiTest program...

The first trick was downloading the files...
I already had the four complete sequences, but hadn’t had the easiest time getting the contigs of the draft assemblies. Do they really want me to download each contig by clicking one at a time? My problem was solved when I found the whole-genome shotgun (WGS) FTP site at NCBI. Of course, these files were named by their WGS code, so I simply hand-search for the correct four letter code, based on this list and downloaded the corresponding file. If I’d been more clever, I would’ve done this from the command line with a script containing the list of WGS data I wanted.

I used the search term "txid727[orgn]" in the Genome Database or the Taxonomy Browser to call up Haemphilus influenzae genomes. Using the Genome database yielded some spare entries that were plasmid sequences, and using Taxonomy browser yielded many isolates where the sequencing is still in progress. Anyways, I could get to a Genome page for each sequenced isolate, but when I clocked to get the contigs, it provided separate accessions for each contig. Boo! There had to be a better way, which took some digging, but was here. This is all the WGS data at NCBI. Whew! Luckily, I found the requisite four-letter codes from my other searches and downloaded the appropriate *.gbff.gz files and expanded them with gunzip.

Anyway, this gave me a bunch of GenBank files with all the contigs in each one. The contigs themselves were simply ordered by size, with the largest first and the smallest last, and the coordinate system corresponded to this ordering. While this is fine, it would probably help me looking at alignments, if I ordered the contigs based on the reference KW20 Rd genome. It would be a meaningless gesture in some ways, but would certainly facilitate looking at pictures.

I did this by using MAUVE’s MoveContig function. I reordered each set of WGS contigs pairwise against the reference KW20 Rd. (Again, I should've been using the command-line, but found it easier to just click my way through... shameful.) Here’s an example of what MoveContigs did comparing PittII to Kw20 Rd:

Alignment #1: Keeps the original order of the contigs (from largest to smallest).
KW20 Rd is on top; PittII is on bottom. The colored blocks indicate contiguous aligned blocks. The red lines indicate contig boundaries (KW20 has only two, for the start and stop of the chromosome). The diagonal line between the two genomes show which blocks belong together. It’s quite obvious from all the crosses that the order of the contigs is off.

The program then reordered the contigs and did another alignment. Then it did this a third time (it keeps iterating until it’s done). It decides this is as good as it can do and stops.

Alignment #3: Reordered contigs maximizing “synteny” between the reference and PittII.
Clearly reordering the contigs has made the picture clearer to see. There are far fewer alignment blocks and fewer diagonally crossing central connector lines. Of course, some of the contigs did span some breakpoints, so there’s obviously rearrangements between the genomes. At each contig boundary, there is always the possibility of some rearrangement between the strains, so this contig order is at best provisional. Another thing I had to keep in mind while tooling around with these alignment programs is that the genome is circular, but the programs don’t treat the strings that way, so there’s often things at the right or left ends that obviously are still syntenic, but just don’t look like it with these linear visualizations.

Here’s the alignments again, without the colored blocks, just so that the reordering of the contigs is quite obvious:

Alignment #1:
Alignment #3:
Notably, the little tiny contigs that failed to find something in the reference to align with remained at the right end.

Alright, so going through 11 of these MoveContig steps provided me with 11 new files that had the contigs in a different order (based on KW20 Rd) than what I got from GenBank. The only unfortunate thing about this was that MoveContigs only outputted FastA, instead of GenBank, so I lost all the annotations. Oh well. I would still be able to use the KW20 Rd annotations. I think that the annotations can still be added back, but I haven’t figured that out yet.

To show that this exercise wasn’t a waste, below I show the stripped-down alignments when I did the big 15-way multiple alignments (I hid the 3 other completely sequenced strains from view, and only the top several contig reorderings are visible.

Before Contig Reordering:
After Contig Reordering:
Much easier to look at... But still sort of meaningless.

Alright! So that’s a 15-way multiple genome alignment. And indeed, it’s associated with a gigantic alignment file in the XMFA format: an extension of the FastA format that allows for a description of alignments that have rearrangements / contig breakponts / etc between them.

The most difficult challenges:
(1) Handling the breakpoints, contig boundaries and real rearrangement breakpoints alike.
(2) Handling indels. Those within an aligned block should be scorable, but how?

So really, the first thing I can work with without much difficulty are contiguous aligned blocks. I grabbed one of these from the XMFA file to produce a single FastA file, which I called test1.fasta, which I planned to use for some preliminary analyses...



  1. Are we pretty cnfident that there are no assembly errors within individual contigs?

  2. I think we have to assume that the contig assembly is right... at least as much as we trust the "complete" genomes. It's possible that they could've closed some of the gaps in their contigs, but were using strict criteria. I noticed while slogging through the alignments in MAUVE that a large number of the contig breakpoints were at rRNA operons, and it makes sense that they wouldn't be able to close these gaps with the 454 technology they used.