Tuesday, October 27, 2009

Sets of Snps

So I’ve got pure DNA from my transformants, pretty much ready to send off for our first Illumina runs. I’m just doing a few simple checks with PCR and digestion to make sure everything is kosher.

But there is this little fear in my head about what I’ll do when I get the massive datasets. I got a hold of some example Illumina GA2 paired-end data taken from E. coli K12. Since I don’t have any data of my own yet from H. influenzae, this seemed like a good dataset to start learning how to do the initial data analysis.

I decided to go with the most widely-used reference-mapping software, called “Maq” for “Mapping and Alignment with Quality”. I can see why it’s widely-used; it is a breeze to install and use, which is the primary requirement for end-users to like a given piece of software. I’ve started dealing with just single-end reads from only one lane. The data is several years old, so there are “only” a few million reads of 35 base-pairs. Nevertheless, this represents nearly 20X sequence coverage of the E. coli K12 genome.

I’ll keep the Maq overview brief, since I went through it in lab meeting and Maq’s documentation is largely quite good (with the caveat that there are some challenges interpreting the meaning of all the columns in the different outputs). In short, Maq takes a “FastQ” file (which is the sequence data, including quality scores) and a “FastA” file (the sequence reference), converts them into a binary format (which is apparently helpful for making the mapping algorithm fast), and then maps individual reads to the reference, allowing up to 2 mismatches. The “pile-up” at each position is used to decide the “consensus base” at that position, based on the majority base and associated quality scores. The mappings can be extracted from the binary output using additional commands.

Here, I’ll focus on the .snps output (obtained by cns2snp... from the man page), since this will be the most straight-forward way for us to define recombinant segments in our transformants. I’m keeping things simple still, so there are a lot of other issues I could get tangled up in, other than the ones I’ll discuss here.

So I “Maqed” this one FastQ dataset from E. coli K12 (I’m calling it s11) against two different reference sequences:
  • K12 -> NC_000913.fna
  • O57 -> NC_002695.fna
The first is a sort of control for how the consensus base caller is working. Since the sequencing is from K12 DNA, most consensus bases called by Maq should match the reference. Occasionally some “SNPs” might appear. These could have several sources:
  1. Spurious due to low coverage and high error
  2. Actual mutations between the original completely sequenced K12 and the K12 used for the Illumina sequencing
  3. Clonal variation where the consensus-calling favors the variant sequence.
The second is to call SNPs between the two strains. I would expect that the number of SNPs called against O57 would vastly exceed that of K12.

This is easily decided with a little UNIX command on the two files to count the lines (which each correspond to a SNP call):
> wc -l K12s11.snp O57s11.snp
6051 K12s11.snp
64217 O57s11.snp
Indeed, there are 10X more SNPs running s11 against O57 than against K12. The several thousand SNPs called against K12 are likely mostly errors. Maq doesn’t always assign the consensus base with A, C, G, or T, but with any of the other IUPAC nucleotide codes, so many of these “SNPs” are probably quite low confidence.

That’s all fine and good, but now what? How can I tell how well this s11 dataset did at calling all the SNPs between K12 and O57? I resorted to using Mummer’s Dnadiff program (previously discussed here) to compare the two reference sequences to each other and extract the SNPs. If the s11 dataset really did a good job sequencing the E. coli genome, there should be a strong overlap between the SNPs called by Mummer and those called by Maq.

(Here’s a GenomeMatcher DotPlot run with Mummer)
Again, UNIX came to the rescue, thanks to this page I found that provides UNIX commands for working with “sets”.

First, I had to get the appropriate three columns from the two SNP files:
  1. the reference genome position (in O57)
  2. the reference base (also from O57)
  3. the query base (the SNP in K12; the consensus base call for Maq or the SNP call for Mummer).
Here's how I made those files:
> cut -f 2 -f 3 -f 4 O57s11.snp > maqSnp.txt
> cut -f 1 -f 2 -f 3 O57vsK12.snps > mumSnp.txt
This provided me with my two sets, one called maqSnp.txt and the other mumSnp.txt. Here’s their lengths:
> wc -l maqSnp.txt mumSnp.txt
64217 maqSnp.txt
76111 mumSnp.txt
Notably, Mummer called many more SNPs than Maq. I think this is largely because the SNP output from Mummer includes single-nucleotide indels, which Maq misses since is does an ungapped alignment. I’m not sure how to deal with this, but in our real experiments, they should still be discoverable, since we’ll map our reads to both donor and recipient genomes. Also, there are numerous “SNPs” in the Maq file that are non-A, C, G, T consensus bases, which will largely be absent from the Mummer comparison.

So, then it was a piece-of-cake to count the SNPs in the intersection of the two sets. Indeed there were several ways to do it, my favorite of which was:
> grep -xF -f maqSnp.txt mumSnp.txt | wc -l

That’s a whole lotta overlap! Happy! When I get the real transformant data, this will help me tremendously.

1 comment:

  1. That final grep command would've been slightly slicker, if I'd added a -c to the command and then ditched the pipe:

    > grep -cxF -f maqSnp.txt mumSnp.txt