Tuesday, October 27, 2009

Sets of Snps

So I’ve got pure DNA from my transformants, pretty much ready to send off for our first Illumina runs. I’m just doing a few simple checks with PCR and digestion to make sure everything is kosher.

But there is this little fear in my head about what I’ll do when I get the massive datasets. I got a hold of some example Illumina GA2 paired-end data taken from E. coli K12. Since I don’t have any data of my own yet from H. influenzae, this seemed like a good dataset to start learning how to do the initial data analysis.

I decided to go with the most widely-used reference-mapping software, called “Maq” for “Mapping and Alignment with Quality”. I can see why it’s widely-used; it is a breeze to install and use, which is the primary requirement for end-users to like a given piece of software. I’ve started dealing with just single-end reads from only one lane. The data is several years old, so there are “only” a few million reads of 35 base-pairs. Nevertheless, this represents nearly 20X sequence coverage of the E. coli K12 genome.

I’ll keep the Maq overview brief, since I went through it in lab meeting and Maq’s documentation is largely quite good (with the caveat that there are some challenges interpreting the meaning of all the columns in the different outputs). In short, Maq takes a “FastQ” file (which is the sequence data, including quality scores) and a “FastA” file (the sequence reference), converts them into a binary format (which is apparently helpful for making the mapping algorithm fast), and then maps individual reads to the reference, allowing up to 2 mismatches. The “pile-up” at each position is used to decide the “consensus base” at that position, based on the majority base and associated quality scores. The mappings can be extracted from the binary output using additional commands.

Here, I’ll focus on the .snps output (obtained by cns2snp... from the man page), since this will be the most straight-forward way for us to define recombinant segments in our transformants. I’m keeping things simple still, so there are a lot of other issues I could get tangled up in, other than the ones I’ll discuss here.

So I “Maqed” this one FastQ dataset from E. coli K12 (I’m calling it s11) against two different reference sequences:
  • K12 -> NC_000913.fna
  • O57 -> NC_002695.fna
The first is a sort of control for how the consensus base caller is working. Since the sequencing is from K12 DNA, most consensus bases called by Maq should match the reference. Occasionally some “SNPs” might appear. These could have several sources:
  1. Spurious due to low coverage and high error
  2. Actual mutations between the original completely sequenced K12 and the K12 used for the Illumina sequencing
  3. Clonal variation where the consensus-calling favors the variant sequence.
The second is to call SNPs between the two strains. I would expect that the number of SNPs called against O57 would vastly exceed that of K12.

This is easily decided with a little UNIX command on the two files to count the lines (which each correspond to a SNP call):
> wc -l K12s11.snp O57s11.snp
6051 K12s11.snp
64217 O57s11.snp
Indeed, there are 10X more SNPs running s11 against O57 than against K12. The several thousand SNPs called against K12 are likely mostly errors. Maq doesn’t always assign the consensus base with A, C, G, or T, but with any of the other IUPAC nucleotide codes, so many of these “SNPs” are probably quite low confidence.

That’s all fine and good, but now what? How can I tell how well this s11 dataset did at calling all the SNPs between K12 and O57? I resorted to using Mummer’s Dnadiff program (previously discussed here) to compare the two reference sequences to each other and extract the SNPs. If the s11 dataset really did a good job sequencing the E. coli genome, there should be a strong overlap between the SNPs called by Mummer and those called by Maq.

(Here’s a GenomeMatcher DotPlot run with Mummer)
Again, UNIX came to the rescue, thanks to this page I found that provides UNIX commands for working with “sets”.

First, I had to get the appropriate three columns from the two SNP files:
  1. the reference genome position (in O57)
  2. the reference base (also from O57)
  3. the query base (the SNP in K12; the consensus base call for Maq or the SNP call for Mummer).
Here's how I made those files:
> cut -f 2 -f 3 -f 4 O57s11.snp > maqSnp.txt
> cut -f 1 -f 2 -f 3 O57vsK12.snps > mumSnp.txt
This provided me with my two sets, one called maqSnp.txt and the other mumSnp.txt. Here’s their lengths:
> wc -l maqSnp.txt mumSnp.txt
64217 maqSnp.txt
76111 mumSnp.txt
Notably, Mummer called many more SNPs than Maq. I think this is largely because the SNP output from Mummer includes single-nucleotide indels, which Maq misses since is does an ungapped alignment. I’m not sure how to deal with this, but in our real experiments, they should still be discoverable, since we’ll map our reads to both donor and recipient genomes. Also, there are numerous “SNPs” in the Maq file that are non-A, C, G, T consensus bases, which will largely be absent from the Mummer comparison.

So, then it was a piece-of-cake to count the SNPs in the intersection of the two sets. Indeed there were several ways to do it, my favorite of which was:
> grep -xF -f maqSnp.txt mumSnp.txt | wc -l

That’s a whole lotta overlap! Happy! When I get the real transformant data, this will help me tremendously.


Friday, October 23, 2009

Transformants Produced!

I'll set aside the hack bioinformatics posts for now and give an update on my transformation experiments... We’ve gotten access to a few lanes of Illumina GA2 sequencing for some preliminary studies, and right now I’m drying the genomic DNA samples that we plan to sequence.

The notion is to sequence several independent transformants of Haemophilus influenzae to get some idea of how much donor DNA taken up by cells finds its way into recipient chromosomes. This pilot study will go a long way in informing our planned large-scale experiments and give us a chance to learn how to handle the data.

Here’s what I did to produce the material...
...some transformations, of course!

First, I PCR amplified the gyrA and gyrB alleles from the MAP7 strain (which confer nalidixic acid and novobiocin resistance, respectively). MAP7 is a derivative of our recipient strain KW20 containing several point mutation that confer antibiotic resistances.

I used these PCR products to transform our donor strain 86-028NP to provide two selectable markers in the donor. I’ve been calling this strain 1350NN.

Then I extracted DNA from this strain and used it as the donor DNA to transform KW20 competent cells. By selecting for one or both markers, I can ensure that clones chosen for DNA extraction and sequencing were indeed derived from competent cells that got transformed.

Our baseline expectation is that there will be a large segment (10-50kb) of donor alleles in the transformants at selected sites and 2-3 additional large segments elsewhere in the genome.

Originally, we were going to do this transformation with only a single marker, but we realized that having two would allow us to measure the frequency of co-transformation.

Here’s what the transformation rates looked like:
I used MAP7 DNA as a donor as a control. Since MAP7 is more closely related to KW20 than 86-028NP, it is perhaps unsurprising that transformation rates were higher when using MAP7 as donor.

As for co-transformation, here’s the frequency of double transformants versus expected:
That corresponds to ~25-35% of the cells in the competent cell preparation actually being competent. I’ve been wracking my brain unsuccessfully trying to figure out how to do a back-of-theenvelope calculation as to how many independent molecules we expect to transform any given recipient. I just can’t figure out a concise or reasonable way to do it. Suffice it to say, I estimate a minimum of 20 kb of donor DNA in each transformant (1% of the genome), up to perhaps 100 kb (5% of the genome).

There’s only one way to find out…

Thursday, October 15, 2009

Computing Bootcamp

Whew, I’ve really fallen behind on my blogging... Last week, a good friend of mine came into town for a “northern retreat”, in which he hoped to get work done on a paper. Instead, he and I drank enormous amounts of beer and did an enormous amount of computing with the Haemophilus infuenzae genome (at least by my standards). While the beer probably didn’t help anything, the computing did.

I’ll go over some of what we did in future posts, but right here I just want to outline some of the computing lessons I learned looking over his shoulder over the week. Many of these lessons have been given to me before and are likely quite basic for the real computationalists out there, but somehow I’ve emerged from the computing emersion with a lot more competence and confidence than I had before...

Here's three useful things I'm getting better at:

(1) Avoid using the mouse. The more that can be accomplished from the command line and using keystrokes, the better. From the command line, tab-completion and cursor-control of the command history make issuing commands far more efficient. The coolest keystroke I’ve now picked up the habit of in the Mac Leopard OS is Cmd-Tab, which takes you to the last active open application (and repeated Cmd-Tabs cycle through the open applications in order of their previous usage). This is perfect for toggling between the command-line and a text-editor where one can keep track of what one is doing.

(2) Toggle between the command-line and a text-editor constantly. Rather than trying to write a whole script and then run it from the command-line, it was far easier and faster to simply try commands out, sending them to the standard output, and cobble together the script line-by-line, adding the working commands to a text document. This has three useful effects: (1) Bugs get worked out before they even go into a script, (2) It forces one to document one’s work, as in a lab notebook. This also ended up being quite useful for my lab meeting this week, in which I decided to illustrate some stuff directly from the terminal. (3) It is forcing me to work “properly”, that is sticking with UNIX commands as much as possible.

(3) Learn how the computer is indexing your data. This point is probably the most important, but also the one that is taking me the most effort. I’ll illustrate with an example (which I’ll get into in more scientific detail later):

The output of one of our scripts was a giant 3 column X 1.8 million row table. I wanted to look at a subset of this huge table, in which the values in some of the cells exceeded some threshold. At first I was doing this (in R) by writing fairly complicated loops, which would go through each line in the file, see if any cells fit my criteria, and then return a new file that only including those rows I was interested in. When I’d run the loop, it would take several minutes for finish. And writing the loop was somewhat cumbersome.

But the extremely valuable thing I learned was that R already had all the data in RAM indexed in a very specific way. Built-in functions (which are extremely fast) allowed me to access a subset of the data using a single simple line of code. Not only did this work dramatically faster, but was much more intuitive to write down. Furthermore, it made it possible for me to index the large dataset in several different ways and instantly call up whichever subset I wanted to plot or whatnot. I ended up with a much leaner and straightforward way of analyzing the giant table and I didn’t need to make a bunch of intermediary files or keep track of as many variables.

Next time, I’ll try and flesh out some of the details what I was doing...

Thursday, October 1, 2009

Corrected Logos

Yesterday, I attempted to make some logos out of the degenerate USS simulated data that Rosie sent me. Turns out, I was doing it wrong. After checking out the original logo paper, I was able to figure out how to make my own logos in Excel. I wasn't supposed to plot the "information content" of each base; I was supposed to take the total information content (in bits) of each position (as determined by the equation in the last post) and then, for each base, multiply that amount by the observed frequency of that base to get the height of that element in the logo. So, below I put the proper logos for the selected and unselected degenerate USS sets (background corrected):

Here's the selected set:
Here's the unselected set: