
Number 6...
...in the "Top 10 New Species of 2009"
Number 10 is pretty mind-boggling...
(continued...)
Repository for my contemplations of transmission genetics and natural transformation in Haemophilus infuenzae, as well as notes on becoming a hack bioinformaticist

Ugh. Too much to blog about. The pace of computer work is totally different than with lab work (though in the end they seem equally labor-intensive), and I've made so many figures and gotten so many numbers in the past week, I barely know where to start...
Okay, this is pretty cool. I will probably discover that it’s just some artifact of the mapping, but digging into the transformant data some more reveals what appears to be a high number of mutations within recombined segments (alleles that have neither donor nor recipient identity)...



(Note that images can be enlarged by clicking)
Within each marker gene, there are a few recipient-specific SNPs. These have to do with the fact that the PCR fragment I used to add the NovR and NalR alleles of gyrB and gyrA contained recipient SNPs and some of these ended up in the donor genome.
On Wednesday, I put up some tables of data with bad calculations of molecules per cell. Rosie's concerns about the data were misplaced; they should've been concerns about my brain!
I keep meaning to put up these links, so I can find them in the future. They get you to NCBI's genome FTP site. If you click on the genome you want, you get a listing of a bunch of file formats. For the complete genomes, the choice for the FastA of the complete sequence is the .fna file. But there’s a bunch of others, which makes getting the correctly formatted data for application X easier:
Looks like less than 8 ng (or ~200 molecule per cell) is sub-saturating and greater than 8 ng is near saturation.

(I'm somehow deeply amused by the above image. I think it has something to do with the arrow going from the DNA molecule to the sun.)
In some sense, the computing I did today isn’t really useful, since I already worked out these things using Microsoft Excel. But I’ve been ordered by my bioinformatics consultants to stop with the Excel already. So as practice, I worked out some of the expected features of degenerate oligos again, but this time using R.
Problem #2: Given a million instances of such an oligo, how well would each possible oligo with k mismatches be observed?
Problem #3: If some number of bases, m, in the n-length oligo are “important”, what proportion of oligos with k mismatches will have x “hits”?
I didn't try super-hard to make the perfect graphs, but it did take some effort to make a stacked bar plot...
(continued...)


So I’ve got pure DNA from my transformants, pretty much ready to send off for our first Illumina runs. I’m just doing a few simple checks with PCR and digestion to make sure everything is kosher.
I’ll keep the Maq overview brief, since I went through it in lab meeting and Maq’s documentation is largely quite good (with the caveat that there are some challenges interpreting the meaning of all the columns in the different outputs). In short, Maq takes a “FastQ” file (which is the sequence data, including quality scores) and a “FastA” file (the sequence reference), converts them into a binary format (which is apparently helpful for making the mapping algorithm fast), and then maps individual reads to the reference, allowing up to 2 mismatches. The “pile-up” at each position is used to decide the “consensus base” at that position, based on the majority base and associated quality scores. The mappings can be extracted from the binary output using additional commands.> wc -l K12s11.snp O57s11.snpIndeed, there are 10X more SNPs running s11 against O57 than against K12. The several thousand SNPs called against K12 are likely mostly errors. Maq doesn’t always assign the consensus base with A, C, G, or T, but with any of the other IUPAC nucleotide codes, so many of these “SNPs” are probably quite low confidence.
6051 K12s11.snp
64217 O57s11.snp
Again, UNIX came to the rescue, thanks to this page I found that provides UNIX commands for working with “sets”.> cut -f 2 -f 3 -f 4 O57s11.snp > maqSnp.txtThis provided me with my two sets, one called maqSnp.txt and the other mumSnp.txt. Here’s their lengths:
> cut -f 1 -f 2 -f 3 O57vsK12.snps > mumSnp.txt
> wc -l maqSnp.txt mumSnp.txtNotably, Mummer called many more SNPs than Maq. I think this is largely because the SNP output from Mummer includes single-nucleotide indels, which Maq misses since is does an ungapped alignment. I’m not sure how to deal with this, but in our real experiments, they should still be discoverable, since we’ll map our reads to both donor and recipient genomes. Also, there are numerous “SNPs” in the Maq file that are non-A, C, G, T consensus bases, which will largely be absent from the Mummer comparison.
64217 maqSnp.txt
76111 mumSnp.txt
> grep -xF -f maqSnp.txt mumSnp.txt | wc -lSo:
57179

I'll set aside the hack bioinformatics posts for now and give an update on my transformation experiments... We’ve gotten access to a few lanes of Illumina GA2 sequencing for some preliminary studies, and right now I’m drying the genomic DNA samples that we plan to sequence.
First, I PCR amplified the gyrA and gyrB alleles from the MAP7 strain (which confer nalidixic acid and novobiocin resistance, respectively). MAP7 is a derivative of our recipient strain KW20 containing several point mutation that confer antibiotic resistances.
Our baseline expectation is that there will be a large segment (10-50kb) of donor alleles in the transformants at selected sites and 2-3 additional large segments elsewhere in the genome.
I used MAP7 DNA as a donor as a control. Since MAP7 is more closely related to KW20 than 86-028NP, it is perhaps unsurprising that transformation rates were higher when using MAP7 as donor.
That corresponds to ~25-35% of the cells in the competent cell preparation actually being competent. I’ve been wracking my brain unsuccessfully trying to figure out how to do a back-of-theenvelope calculation as to how many independent molecules we expect to transform any given recipient. I just can’t figure out a concise or reasonable way to do it. Suffice it to say, I estimate a minimum of 20 kb of donor DNA in each transformant (1% of the genome), up to perhaps 100 kb (5% of the genome).
Whew, I’ve really fallen behind on my blogging... Last week, a good friend of mine came into town for a “northern retreat”, in which he hoped to get work done on a paper. Instead, he and I drank enormous amounts of beer and did an enormous amount of computing with the Haemophilus infuenzae genome (at least by my standards). While the beer probably didn’t help anything, the computing did.
(1) Avoid using the mouse. The more that can be accomplished from the command line and using keystrokes, the better. From the command line, tab-completion and cursor-control of the command history make issuing commands far more efficient. The coolest keystroke I’ve now picked up the habit of in the Mac Leopard OS is Cmd-Tab, which takes you to the last active open application (and repeated Cmd-Tabs cycle through the open applications in order of their previous usage). This is perfect for toggling between the command-line and a text-editor where one can keep track of what one is doing.
(2) Toggle between the command-line and a text-editor constantly. Rather than trying to write a whole script and then run it from the command-line, it was far easier and faster to simply try commands out, sending them to the standard output, and cobble together the script line-by-line, adding the working commands to a text document. This has three useful effects: (1) Bugs get worked out before they even go into a script, (2) It forces one to document one’s work, as in a lab notebook. This also ended up being quite useful for my lab meeting this week, in which I decided to illustrate some stuff directly from the terminal. (3) It is forcing me to work “properly”, that is sticking with UNIX commands as much as possible.
(3) Learn how the computer is indexing your data. This point is probably the most important, but also the one that is taking me the most effort. I’ll illustrate with an example (which I’ll get into in more scientific detail later):
Yesterday, I attempted to make some logos out of the degenerate USS simulated data that Rosie sent me. Turns out, I was doing it wrong. After checking out the original logo paper, I was able to figure out how to make my own logos in Excel. I wasn't supposed to plot the "information content" of each base; I was supposed to take the total information content (in bits) of each position (as determined by the equation in the last post) and then, for each base, multiply that amount by the observed frequency of that base to get the height of that element in the logo. So, below I put the proper logos for the selected and unselected degenerate USS sets (background corrected):
Here's the unselected set:
Woo!
Previously, I had shown some preliminary analysis of Rosie’s simulated uptake data of chromosomal DNA fragments. Rosie also sent me simulated uptake data of degenerate USS sequences (using a 12% degeneracy per position in this USS consensus sequence :
Notably, even the unselected sequences have rather high scores, when compared to the same analysis of genomic DNA fragments. This is unsurprising, since the sequences under selection in this degenerate USS simulation are all rather close to the consensus USS.
(I think the reason for the difference in the “selected” distributions is due to a different level of stringency when Rosie produced the two simulated datasets.)
Rseq = the information at a particular position in the alignment
Here’s the unselected set:
Looking closely, it is clear that there are differences in the amount of “information” at each position. So in the strong consensus positions of the USS, the selected set has higher “information” than the unselected set, while at weak consensus positions, that’s less true.
Here’s the unselected set of 100 sequences:
(Note that I somehow lost the first position when I did this, so the motif starts at the first position of the core.)