
Anyways, I guess that last entry deserves more explanation, but it'll take me more than a post to do it. First off, the motivation for doing these kinds of analyses:
There’s 4 complete Haemophilus influenzae genomes, 11 with draft assemblies (in 4-50 contigs), and 17 for which sequencing is in progress. On top of that, hundreds of strains have been sequenced at seven house-keeping genes (MLST studies).
Within this “population genomic” data, we should be able to do some estimates of recombination rates between isolates, as well as some sophisticated analyses of genetic variation at USS motifs. But how? Here’s the basic idea:
(1) Align the complete and draft genomes.
(2) Use programs written to detect recombination signals in population genetic data.
(3) Examine genetic variation in and around USS motifs (scored for information content across each genome) and see if they correlate with the determined recombination signals.
There are several issues with this at each step, but I’ll take the first one first. I’ll start with a post describing how I produced the multiple alignment file that I fed to the PhiTest program...
The first trick was downloading the files...
I already had the four complete sequences, but hadn’t had the easiest time getting the contigs of the draft assemblies. Do they really want me to download each contig by clicking one at a time? My problem was solved when I found the whole-genome shotgun (WGS) FTP site at NCBI. Of course, these files were named by their WGS code, so I simply hand-search for the correct four letter code, based on this list and downloaded the corresponding file. If I’d been more clever, I would’ve done this from the command line with a script containing the list of WGS data I wanted.
I used the search term "txid727[orgn]" in the Genome Database or the Taxonomy Browser to call up Haemphilus influenzae genomes. Using the Genome database yielded some spare entries that were plasmid sequences, and using Taxonomy browser yielded many isolates where the sequencing is still in progress. Anyways, I could get to a Genome page for each sequenced isolate, but when I clocked to get the contigs, it provided separate accessions for each contig. Boo! There had to be a better way, which took some digging, but was here. This is all the WGS data at NCBI. Whew! Luckily, I found the requisite four-letter codes from my other searches and downloaded the appropriate *.gbff.gz files and expanded them with gunzip.
Anyway, this gave me a bunch of GenBank files with all the contigs in each one. The contigs themselves were simply ordered by size, with the largest first and the smallest last, and the coordinate system corresponded to this ordering. While this is fine, it would probably help me looking at alignments, if I ordered the contigs based on the reference KW20 Rd genome. It would be a meaningless gesture in some ways, but would certainly facilitate looking at pictures.
I did this by using MAUVE’s MoveContig function. I reordered each set of WGS contigs pairwise against the reference KW20 Rd. (Again, I should've been using the command-line, but found it easier to just click my way through... shameful.) Here’s an example of what MoveContigs did comparing PittII to Kw20 Rd:
Alignment #1: Keeps the original order of the contigs (from largest to smallest).

The program then reordered the contigs and did another alignment. Then it did this a third time (it keeps iterating until it’s done). It decides this is as good as it can do and stops.
Alignment #3: Reordered contigs maximizing “synteny” between the reference and PittII.

Here’s the alignments again, without the colored blocks, just so that the reordering of the contigs is quite obvious:
Alignment #1:


Alright, so going through 11 of these MoveContig steps provided me with 11 new files that had the contigs in a different order (based on KW20 Rd) than what I got from GenBank. The only unfortunate thing about this was that MoveContigs only outputted FastA, instead of GenBank, so I lost all the annotations. Oh well. I would still be able to use the KW20 Rd annotations. I think that the annotations can still be added back, but I haven’t figured that out yet.
To show that this exercise wasn’t a waste, below I show the stripped-down alignments when I did the big 15-way multiple alignments (I hid the 3 other completely sequenced strains from view, and only the top several contig reorderings are visible.
Before Contig Reordering:


Alright! So that’s a 15-way multiple genome alignment. And indeed, it’s associated with a gigantic alignment file in the XMFA format: an extension of the FastA format that allows for a description of alignments that have rearrangements / contig breakponts / etc between them.
The most difficult challenges:
(1) Handling the breakpoints, contig boundaries and real rearrangement breakpoints alike.
(2) Handling indels. Those within an aligned block should be scorable, but how?
So really, the first thing I can work with without much difficulty are contiguous aligned blocks. I grabbed one of these from the XMFA file to produce a single FastA file, which I called test1.fasta, which I planned to use for some preliminary analyses...
TO BE CONTINUED
Are we pretty cnfident that there are no assembly errors within individual contigs?
ReplyDeleteI think we have to assume that the contig assembly is right... at least as much as we trust the "complete" genomes. It's possible that they could've closed some of the gaps in their contigs, but were using strict criteria. I noticed while slogging through the alignments in MAUVE that a large number of the contig breakpoints were at rRNA operons, and it makes sense that they wouldn't be able to close these gaps with the 454 technology they used.
ReplyDelete