Monday, December 28, 2009

Vacation Post: Uncoiled!

Number 6... the "Top 10 New Species of 2009"

Number 10 is pretty mind-boggling...

Friday, December 11, 2009

Update on problems with analysis

Ugh. Too much to blog about. The pace of computer work is totally different than with lab work (though in the end they seem equally labor-intensive), and I've made so many figures and gotten so many numbers in the past week, I barely know where to start...

Well, I guess I'll just blather about a couple problems I've been working through, but leave out all the charts and graphs and figures for the time being:

500-fold sequence coverage still "misses" parts of the genome:

We got a ton of data. For our controls, it was surely massive overkill. Nevertheless, "read depth" (how many read mappings cover a particular position) still varies by a substantial amount. There are numerous sources of this variation (%GC being one that is quite apparent), but I am most worried about variation in "read depth" due to the alignment algorithm I'm using to map reads to the genome.

As I try to root out artifacts in my discovery of putative recombination-associated mutations, I confront the fact that "read depth" is on average reduced when recombinant donor segments are mapped back to the recipient genome, so the novel mutations I found in these strains are on average supported by far fewer sequence reads than the average base... Most of them still look pretty solid to me (though there are several obvious artifacts), but I don't have a good rationale for whether or not to trust them.

I'm trying several things to work this out, namely by examinining the reciprocal mappings (using the donor chromosome as the reference).

So far, my analysis has a huge risk of false negatives:

Several problems here.

(a) Part of this problem and the last one is that I am using an alignment package that does not account for gaps (Maq). This means even a single nucleotide indel reduces "read depth" dramatically on either side (out to ~42 bases, or read length). See above.

(b) Another issue I'm facing with several of Maq's downstream outputs is that "read depth" is capped at 255. Presumably, they were conserving memory and only assigned a byte to this number. But what I haven't quite figured out is whether the SNP output (for example) is ignoring any possible SNPs where coverage exceeded 255. My cursory look at the more raw output (the "pileup") suggests this might well be the case. This could mean that I'm missing a lot, since the mean "read depth" per genome position in our datasets is ~500.

(c) Finally, I've been ignoring all Maq's self-SNP and "heterozygous" SNP calls in my downstream analysis using R. I presume that SNPs called in my mapping of the recipient genome to the complete recipient sequence are simply mutations between our wild-type Rd strain and the sequenced one. (As an aside, several hundred of the SNPs called by Maq were actually giving the correct base for an ambiguous base in the "complete" genome. I'd like to find a way to somehow revise the archived Rd sequence to get rid of all the ambiguous bases.) And I don't have a solid plan on how to deal with the "heterozygous" calls. Because the Maq assembly program can only have greater than or equal to two haplotypes, positions with mixed base signals are called heterozygotes. These is actually pretty cool and could reflect cool stuff like clonal variation, but largely these are probably due to multiply mapping reads and/or persistent sequencing errors.

Solutions: The solutions to these problems will initially largely be a matter of doing everything all again with different alignment software. My plan is to use the BWA aligner and SAMtools. BWA allows for gaps (so the "read depth" issue should be partially solved), and SAMtools not only keeps everything in a new agreed-upon standard format, but has several other tools, including what looks to be a better SNP caller (at least it has more modifiable settings). I would also like to try to do some de novo assembly, perhaps with Velvet, since we have such absurd coverage and a simple enough genome.

In the meantime, my R-fu has been improving, though I am convinced that I am missing some really basic principles that would make my code run a lot faster.

Wednesday, December 2, 2009

Mutagenic recombination?

Okay, this is pretty cool. I will probably discover that it’s just some artifact of the mapping, but digging into the transformant data some more reveals what appears to be a high number of mutations within recombined segments (alleles that have neither donor nor recipient identity)...

For this analysis, I used much more stringent criteria for calling SNPs, so that low quality SNP calls would not contaminate the result. This is particularly important here, since we might expect to get lower quality SNP calls in the recombinant segments, due to the relatively high divergence between the donor and recipient genomes and the limitations of current mapping algorithms.

For TfA, there were 802 unambiguous donor alleles and 19 high-quality novel alleles, while for TfB, there were 902 unambiguous donor alleles and 21 high-quality novel alleles.

The two plots below indicate the presence of unambiguous donor alleles in blue bars going to 1 (which defined the recombinant segments), and the presence of unambiguous mutant alleles in red bars going to -1. (Click to enlarge)

That looks pretty striking! Mutations are clearly clustered into the recombined segments!

A few of the "novel alleles" in the two genomes are shared. 4 of 6 are in the first overlapping donor segment, and the other two are outside the donor segments. It is still early to be too confident in this result, but still! It is very suggestive.

I’ve never really taken the supposed causal connection between recombination and mutation too seriously, since the evidence mostly seems correlative to me, but if this result holds up, I think it will be a uniquely clear-cut example of mutations induced by recombination.

Before being confident in the result, I need to map the data back to the donor genome, and cross-check the result. If that works, some simple PCR and traditional sequencing should readily confirm or refute this deep sequencing result.

Monday, November 30, 2009

Recombinant Genomes: First Pass

We’ve now got deep sequencing data of donor, recipient, and transformant genomes. And it is, indeed, “deep”, at least in quantity. Here’s the post where I illustrate the details of how the sequenced DNA was obtained. There were five “lanes” of sequencing obtained, of which I will talk about the first four here:

Lane 1: Rd (the recipient genome)

Lane 2: 1350NN (the donor genome 86-028NP plus NovR and NalR alleles of gyrB and gyrA, respectively)

Lane 3: Transformant A (the genome of a NovR transformed clone)

Lane 4: Transformant B (the genome of a NalR transformed clone)

So I’ve done a first-pass analysis of the sequencing data using Maq as the mapping algorithm. I left everything at default settings, and so far have only analyzed the datasets with respect to the recipient genome, Rd.

The expectation is that when I map Lane 1 to Rd, there will be few differences detected by Maq; when Lane 2 is mapped to Rd, the bulk of SNPs between Rd and 86-028NP will be detected; and when Lanes 3 and 4 are mapped to Rd, the donor DNA segments transformed into the recipient background will be identified. For lanes 3 and 4, since we know the locus that we selected for donor information, we have controls for whether or not an appropriate donor segment was detected.

Before continuing, I must specify that this is a very preliminary analysis: I do not consider the quality of the SNP calls, beyond whatever Maq does to make the calls (and culling ambiguous SNPs, as described below). I have not mapped anything with respect to the donor genome. I have not considered any polymorphism between the strains other than simple single-nucleotide substitutions (since Maq only does ungapped alignment). I also missed regions of very high SNP density, since the Maq default will not map any read with >2 mismatches from the reference. Finally, I have only cursorily examined depth of coverage across each genome (it is ~500-fold on average, ranging from ~100 to ~1000 over each dataset).

However, even with these caveats, the approximate sizes and locations of transformed DNA segments were pretty clear...

Here are the number of “SNPs” called by Maq between each dataset and Rd:

Lane 1: 933 (Rd)
Lane 2: 30,284 (1350NN)
Lane 3: 1,870 (TfA-NovR)
Lane 4: 1,881 (TfB-NalR)

Rd versus Rd

The first obvious issue is that when the Rd DNA was mapped against the Rd genome sequence, Maq called 933 “SNPs”. What is all this supposed “variation”?

108 had “N” as the reference base
432 had an ambiguous base as the query

So 540 / 933 “SNPs” are easily explained artifacts-- either ambiguous positions in the complete Rd genome sequence reference, or ambiguous base calls by Maq from the Illumina GA2 dataset.

The remaining “SNPs” may also be persistent sequencing/mapping artifacts, or they may be true genetic differences between the “Rd” we sequenced (our lab’s wild type) and the original DNA sample sequenced back in the 1990s.

To simplify matters I culled any position called a “SNP” between Rd and Rd, as well as all other ambiguous positions, from the remaining datasets.

Rd versus 1350NN

Before turning to the tranformants, I used Lane 2 to make a list of detectable SNP positions between the donor DNA and the recipient chromosome. Of the 30,284 “SNPs” detected by Maq, 29,002 were neither in the Rd SNP set nor had an ambiguous base in either the reference or query.

Note that I am not using SNPs identified by comparison of the two complete genome sequences, but rather those that were unambiguously determined by this sequencing experiment. Rd remains the only reference genome I have used.

I used this set of SNPs as the set of “Donor-specific alleles” to map transforming DNA segments.

The transformants

To identify donor-specific alleles in the transformants, I took the intersection of the “Donor-specific alleles” and the unambiguous SNPs identified by Maq for the two transformants, yielding the following number of SNPs in each transformant:

TfA: 890
TfB: 975

This suggests that about 3.0-3.5% of each transformant genome consists of donor DNA. This value is consistent with what we might expect, based on the co-transformation frequency of the two donor markers into the recipient genome when I did the original transformation that produced the sequenced clones.

Here are plots of the TfA and TfB genomes (genotype 0 = recipient allele, genotype 1 = donor allele):
(Note that images can be enlarged by clicking)

This makes me happy. I was kind of hoping our thinking was wrong and that there’d be all kinds of kookiness going on, but in many ways, having our expectations met is a vastly better situation, since it means that the designs of our other planned experiments are probably sound.

Note that the right-most "donor segment" represents only a single "donor-specific allele" that is identical in both recombinants, as well as being surrounded by Rd vs. Rd SNPs. It is highly likely that this singleton is an artifact. All other donor segments are supported by many SNPs, including the overlapping segments ~200,000 bp. This latter shared segment may suggest a hotspot of recombination.

Control loci

The following two plots zoom in on the segments containing the control selected loci. The first has red lines bounding the gyrB gene in TfA, while the second has blue lines bounding the gyrA genome in TfB. Also in each plot, the masked positions (those that were left out due to ambiguity or presence in the Rd vs. Rd comparison) are show (for the first, in orange; for the second, in grey):
Within each marker gene, there are a few recipient-specific SNPs. These have to do with the fact that the PCR fragment I used to add the NovR and NalR alleles of gyrB and gyrA contained recipient SNPs and some of these ended up in the donor genome.

Okay! That’s almost it for now. There’s a looong way to go, but happily I suspect that there is indeed biology to learn even from these “meagre” couple of genome sequences.

My next task will be to account better for where I am blind. I used stringent criteria to determine allele identity here. I am quite confident in what was found, but I’m not sure sure about what I didn’t find. That is, I suspect false negatives, but not false positives, based on how I’ve done this so far.

Tuesday, November 24, 2009

Incoming Data Deluge

I am waiting with anticipation... I am about to be drowning in data... >64 million quality filtered paired-end sequencing reads, constituting 5.4 Gigabases! Woo!!

The files are currently transferring to an FTP site that I will then download from. Tomorrow afternoon, I may legitimately be a genomicist!

Thanks to my friend and his tech for getting this done...

Friday, November 20, 2009


On Wednesday, I put up some tables of data with bad calculations of molecules per cell. Rosie's concerns about the data were misplaced; they should've been concerns about my brain!

I'd been saying that 8 ng of 200 bp DNA / 200 ul of competent cells was ~ 1 molecule per cell. But I was off. Not just because I calculated it for pg instead of ng, but because I'd screwed up other parts of the math...

Here's a more correct way to approximate the number of molecules per cell starting with one of Rosie's Universal Constants:

(1e-18 g / one 1 kb DNA) * (one 200bp DNA / five 1kb DNA) =
(2e-19 g / one 200 bp DNA) * (1e9 ng / 1 g) =
2e-10 ng / one 200 bp DNA.
So that's about how much a single 200 bp DNA molecule weighs.

Then the inverse gives molecules per ng:
1 / (2e-10 ng / one 200 bp DNA) =
(5e+9 molecules DNA / ng DNA) * (8 ng) =
4e+10 molecules DNA / 8 ng DNA.

Adding 8 ng of DNA to 2e+8 cells (in 200 ul of competent cells), then gives:
200 molecules per cell...

That's better. And more in-line with the lab's previous work...
I have changed the Tables in Wednesday's post to reflect this correction.

Thursday, November 19, 2009

Bacterial Genome Sequence Links

I keep meaning to put up these links, so I can find them in the future. They get you to NCBI's genome FTP site. If you click on the genome you want, you get a listing of a bunch of file formats. For the complete genomes, the choice for the FastA of the complete sequence is the .fna file. But there’s a bunch of others, which makes getting the correctly formatted data for application X easier:

Bacterial Genomes FTP

For the draft assemblies, the way to get the data is at this link:
Whole Genome Shotgun FTP

But to find out which accession you want is a little irritating and the best way is via searching this list:
Whole Genome Shotgun HTML

Using the supplied links from this last list is not so useful for downloading the data as is the FTP link, since you would have to click a maddening amount to get all the contigs from a particular whole genome shotgun sequencing project.

Wednesday, November 18, 2009

Saturation and Chaser

EDITED: Used bad mol per cell calculation earlier.

How much DNA does an average cell in a competent culture take up? I did two experiments today, which were aimed at trying to decide how much degeneracy we can tolerate in our degenerate USS experiment. There's several issues, but one simple one is just to know how much of an "optimal" USS is taken up in the presence of competing "suboptimal" USS...

In the first experiment, I added different amounts of hot USS-1 DNA (the new-fangled ones designed for Illumina sequencing... ~200 bp) to a fixed amount of competent cells to find out how much DNA was enough to saturate the cells with DNA. Here's the results for several amounts of USS-1 incubated with 200 ul of competent cells for 20 minutes:
Looks like less than 8 ng (or ~200 molecule per cell) is sub-saturating and greater than 8 ng is near saturation.

In the second experiment, I looked at how well USS-1 is taken up in the presence of a competitor mutant USS (USS-V6, which is taken up 10X worse than USS-1). Based on the results of the first experiment, I used two different amounts of hot USS-1, either sub-saturating (4ng) or saturating (16ng). For each of these, I had four samples in which different amounts of cold USS-V6 were added. Here's the results for subsaturating USS-1:

And now for saturating USS-1:

Though adding the competitor did reduce uptake, uptake decreased only by ~1/4, even when there was several-fold more competitor. Also, it appears that using saturating amounts of USS-1 yields a different decline in uptake with excess inhibitor than when USS-1 is sub-saturating.

So what's this all mean? I think it means that even when "optimal" USS are only 1/10 as common as "suboptimal" USS, uptake proceeds quite nicely. What does this mean practically for making our fancy oligo purchase? Not too much, except that I think we can err on the side of more degeneracy without too many problems. (This will be good for several reasons, probably the most important of which is that our preliminary data collection plans will work best with more degeneracy.)

Tuesday, November 17, 2009

USS uptake with Illumina adaptors

(I'm somehow deeply amused by the above image. I think it has something to do with the arrow going from the DNA molecule to the sun.)

I've finally gotten around to doing some uptake experiments with the control sequences I got for our degenerate USS experiments. The important thing was to make sure that the constructs would behave as our older ones do with our new-fangled design that will allow for Illumina sequencing directly from the purified periplasmic DNA (i.e. no library construction steps)...

And it worked quite well! I also scaled these experiments down, so that I could more easily do some saturation curves tomorrow.

Results today for uptake of 8 ng of 200 bp DNA (~160 million molecules) by ~200 million competent cells (in 200 ul):

O.G.-USS1 -> 33.5%
New-USS1 -> 35.4%
New-USSV6 -> 3.7%
New-USSR -> 0.7%

Right on target!

Missing Posts

Whew! I keep falling behind on my blogging.... Here are the posts that I haven’t written:

Using R:

(1) Simulation of degenerate USS experiment, assuming the USS motif defines uptake specificity.

(2) Scoring genomes for USS sites.

(3) Scoring alignments for USS sites. (This last bit is finally begun, but far from complete.)

I've got the figures for these posts; now I just need text!

Friday, November 6, 2009

Weening myself off of Excel

In some sense, the computing I did today isn’t really useful, since I already worked out these things using Microsoft Excel. But I’ve been ordered by my bioinformatics consultants to stop with the Excel already. So as practice, I worked out some of the expected features of degenerate oligos again, but this time using R.

The main motivation for doing this besides practice is that I am fairly sure we should be ordering degenerate oligos with more degeneracy than we have previously considered. I won't make that argument here, but just repeat some analytical graphs I'd previously made.

It took a while (since I’m learning), but was still much more straight-forward than doing it in a spreadsheet. The exercise was extremely useful, as I learned a bunch of stuff (especially about plots in R), while doing the following:

Problem #1: Given a percentage of degeneracy per base, d, in an n length oligo, what is the proportion of oligos with k mismatches?
Answer #1: Use the binomial distribution. For a 32mer with different levels of degeneracy (shown in legend):
Problem #2: Given a million instances of such an oligo, how well would each possible oligo with k mismatches be observed?
Answer #2: Simply adjust each of the above values by dividing the number of classes within each of k mismatches (i.e. choose(n, k)):
Problem #3: If some number of bases, m, in the n-length oligo are “important”, what proportion of oligos with k mismatches will have x “hits”?
Answer #3: Use the hypergeometric distribution. The below plot is as for Problem #1 for 0.12 degeneracy, but with the # of hits broken down for each k:
I didn't try super-hard to make the perfect graphs, but it did take some effort to make a stacked bar plot...

Funding versus Science

Yesterday, I ended up doing none of the things I intended to do, but cleaned them instead: (1) No bench-work, but a clean bench (and fridge and freezer spaces)!; (2) No emails, but a clean inbox!; (3) No thinking, but a clean mind! (Well, moreso than usual.)

My ears were also itching, because I realized that the NIH panel that's review my postdoc grant application was meeting, and was extra-worried, since now the Genome BC grant almost certainly will depend on the outcome of those scores.

Anyways, towards the end of the day, I switched over to browsing journals, which I rarely do these days...

I found a lot of good stuff I probably already should've known about, but I also found this opinion piece in PLoS about "what's wrong with funding of research". Pretty much sounds about right. I am not even in a position to have the stresses described in the article, but feel like if I don't get some kind of postdoc fellowship of my own, I'll be that much less likely to get hired as an independent researcher, since so much of what defines success is the ability to get money. But all of the PIs I know are overwhelmed by exactly these issues. It's especially daunting to realize that a full-sized NIH R01 grant can barely support a lab with 2-3 people.

On the other hand, my recent experiences with Rosie do bring home the fact that grant-writing is a good way to think rigorously about one's plans, and the Genome BC experience in particular (whether the grant is funded or not) helped our brains wrap around exactly what we'll be doing in the next couple of months.

Wednesday, November 4, 2009


Whew! Rosie and I just made what I consider a heroic effort to produce a grant application to Genome BC to use DNA sequencing to measure recombination biases during H. influenzae natural transformation.

It was heroic not only because we finally decided to apply only late last week, but because our co-funding support is tenuous at best. Genome BC requires that we match their funds at least equally with funds from another source. Rosie has funding, but it was applied for too long ago and the proposal only indirectly relates to our planned sequencing. We also applied for a CIHR grant recently, but will not have reviews until after the Genome BC committee meets in early January. The best hope of adequate co-funding comes from my NIH postdoctoral fellowship grant application (a resubmission) late this summer, for which I should have scores (or lack thereof) within a couple of weeks. If the application gets a good score, we can tell Genome BC that the major threat to the success of our application is ameliorated.

Almost immediately after submitting the grant application with Rosie last night, I had to turn to editing my buddy's manuscript (which I am an author on), which takes on the weighty topic of detecting rearrangements in complex mammalian genomes from limited sequencing data. I just turned my edits over to the corresponding author, and now, after letting the excess nitrogen out of my bloodstream, I need to decide what to do next.

Since aforementioned buddy also has taken several DNA samples off my hands for sequencing, I think I'd best turn to purifying uptake DNA from the periplasm of competent cells. I've already gotten things fairly well under way (see here, here, and here), but there's a bunch of uptake experiments waiting to be done. So tonight, I'll inoculate some cultures, so I can try a large-scale periplasmic DNA prep tomorrow, and tomorrow I'll also order some more radiolabel for doing more sensitive uptake experiments.

Tuesday, October 27, 2009

Sets of Snps

So I’ve got pure DNA from my transformants, pretty much ready to send off for our first Illumina runs. I’m just doing a few simple checks with PCR and digestion to make sure everything is kosher.

But there is this little fear in my head about what I’ll do when I get the massive datasets. I got a hold of some example Illumina GA2 paired-end data taken from E. coli K12. Since I don’t have any data of my own yet from H. influenzae, this seemed like a good dataset to start learning how to do the initial data analysis.

I decided to go with the most widely-used reference-mapping software, called “Maq” for “Mapping and Alignment with Quality”. I can see why it’s widely-used; it is a breeze to install and use, which is the primary requirement for end-users to like a given piece of software. I’ve started dealing with just single-end reads from only one lane. The data is several years old, so there are “only” a few million reads of 35 base-pairs. Nevertheless, this represents nearly 20X sequence coverage of the E. coli K12 genome.

I’ll keep the Maq overview brief, since I went through it in lab meeting and Maq’s documentation is largely quite good (with the caveat that there are some challenges interpreting the meaning of all the columns in the different outputs). In short, Maq takes a “FastQ” file (which is the sequence data, including quality scores) and a “FastA” file (the sequence reference), converts them into a binary format (which is apparently helpful for making the mapping algorithm fast), and then maps individual reads to the reference, allowing up to 2 mismatches. The “pile-up” at each position is used to decide the “consensus base” at that position, based on the majority base and associated quality scores. The mappings can be extracted from the binary output using additional commands.

Here, I’ll focus on the .snps output (obtained by cns2snp... from the man page), since this will be the most straight-forward way for us to define recombinant segments in our transformants. I’m keeping things simple still, so there are a lot of other issues I could get tangled up in, other than the ones I’ll discuss here.

So I “Maqed” this one FastQ dataset from E. coli K12 (I’m calling it s11) against two different reference sequences:
  • K12 -> NC_000913.fna
  • O57 -> NC_002695.fna
The first is a sort of control for how the consensus base caller is working. Since the sequencing is from K12 DNA, most consensus bases called by Maq should match the reference. Occasionally some “SNPs” might appear. These could have several sources:
  1. Spurious due to low coverage and high error
  2. Actual mutations between the original completely sequenced K12 and the K12 used for the Illumina sequencing
  3. Clonal variation where the consensus-calling favors the variant sequence.
The second is to call SNPs between the two strains. I would expect that the number of SNPs called against O57 would vastly exceed that of K12.

This is easily decided with a little UNIX command on the two files to count the lines (which each correspond to a SNP call):
> wc -l K12s11.snp O57s11.snp
6051 K12s11.snp
64217 O57s11.snp
Indeed, there are 10X more SNPs running s11 against O57 than against K12. The several thousand SNPs called against K12 are likely mostly errors. Maq doesn’t always assign the consensus base with A, C, G, or T, but with any of the other IUPAC nucleotide codes, so many of these “SNPs” are probably quite low confidence.

That’s all fine and good, but now what? How can I tell how well this s11 dataset did at calling all the SNPs between K12 and O57? I resorted to using Mummer’s Dnadiff program (previously discussed here) to compare the two reference sequences to each other and extract the SNPs. If the s11 dataset really did a good job sequencing the E. coli genome, there should be a strong overlap between the SNPs called by Mummer and those called by Maq.

(Here’s a GenomeMatcher DotPlot run with Mummer)
Again, UNIX came to the rescue, thanks to this page I found that provides UNIX commands for working with “sets”.

First, I had to get the appropriate three columns from the two SNP files:
  1. the reference genome position (in O57)
  2. the reference base (also from O57)
  3. the query base (the SNP in K12; the consensus base call for Maq or the SNP call for Mummer).
Here's how I made those files:
> cut -f 2 -f 3 -f 4 O57s11.snp > maqSnp.txt
> cut -f 1 -f 2 -f 3 O57vsK12.snps > mumSnp.txt
This provided me with my two sets, one called maqSnp.txt and the other mumSnp.txt. Here’s their lengths:
> wc -l maqSnp.txt mumSnp.txt
64217 maqSnp.txt
76111 mumSnp.txt
Notably, Mummer called many more SNPs than Maq. I think this is largely because the SNP output from Mummer includes single-nucleotide indels, which Maq misses since is does an ungapped alignment. I’m not sure how to deal with this, but in our real experiments, they should still be discoverable, since we’ll map our reads to both donor and recipient genomes. Also, there are numerous “SNPs” in the Maq file that are non-A, C, G, T consensus bases, which will largely be absent from the Mummer comparison.

So, then it was a piece-of-cake to count the SNPs in the intersection of the two sets. Indeed there were several ways to do it, my favorite of which was:
> grep -xF -f maqSnp.txt mumSnp.txt | wc -l

That’s a whole lotta overlap! Happy! When I get the real transformant data, this will help me tremendously.


Friday, October 23, 2009

Transformants Produced!

I'll set aside the hack bioinformatics posts for now and give an update on my transformation experiments... We’ve gotten access to a few lanes of Illumina GA2 sequencing for some preliminary studies, and right now I’m drying the genomic DNA samples that we plan to sequence.

The notion is to sequence several independent transformants of Haemophilus influenzae to get some idea of how much donor DNA taken up by cells finds its way into recipient chromosomes. This pilot study will go a long way in informing our planned large-scale experiments and give us a chance to learn how to handle the data.

Here’s what I did to produce the material...
...some transformations, of course!

First, I PCR amplified the gyrA and gyrB alleles from the MAP7 strain (which confer nalidixic acid and novobiocin resistance, respectively). MAP7 is a derivative of our recipient strain KW20 containing several point mutation that confer antibiotic resistances.

I used these PCR products to transform our donor strain 86-028NP to provide two selectable markers in the donor. I’ve been calling this strain 1350NN.

Then I extracted DNA from this strain and used it as the donor DNA to transform KW20 competent cells. By selecting for one or both markers, I can ensure that clones chosen for DNA extraction and sequencing were indeed derived from competent cells that got transformed.

Our baseline expectation is that there will be a large segment (10-50kb) of donor alleles in the transformants at selected sites and 2-3 additional large segments elsewhere in the genome.

Originally, we were going to do this transformation with only a single marker, but we realized that having two would allow us to measure the frequency of co-transformation.

Here’s what the transformation rates looked like:
I used MAP7 DNA as a donor as a control. Since MAP7 is more closely related to KW20 than 86-028NP, it is perhaps unsurprising that transformation rates were higher when using MAP7 as donor.

As for co-transformation, here’s the frequency of double transformants versus expected:
That corresponds to ~25-35% of the cells in the competent cell preparation actually being competent. I’ve been wracking my brain unsuccessfully trying to figure out how to do a back-of-theenvelope calculation as to how many independent molecules we expect to transform any given recipient. I just can’t figure out a concise or reasonable way to do it. Suffice it to say, I estimate a minimum of 20 kb of donor DNA in each transformant (1% of the genome), up to perhaps 100 kb (5% of the genome).

There’s only one way to find out…

Thursday, October 15, 2009

Computing Bootcamp

Whew, I’ve really fallen behind on my blogging... Last week, a good friend of mine came into town for a “northern retreat”, in which he hoped to get work done on a paper. Instead, he and I drank enormous amounts of beer and did an enormous amount of computing with the Haemophilus infuenzae genome (at least by my standards). While the beer probably didn’t help anything, the computing did.

I’ll go over some of what we did in future posts, but right here I just want to outline some of the computing lessons I learned looking over his shoulder over the week. Many of these lessons have been given to me before and are likely quite basic for the real computationalists out there, but somehow I’ve emerged from the computing emersion with a lot more competence and confidence than I had before...

Here's three useful things I'm getting better at:

(1) Avoid using the mouse. The more that can be accomplished from the command line and using keystrokes, the better. From the command line, tab-completion and cursor-control of the command history make issuing commands far more efficient. The coolest keystroke I’ve now picked up the habit of in the Mac Leopard OS is Cmd-Tab, which takes you to the last active open application (and repeated Cmd-Tabs cycle through the open applications in order of their previous usage). This is perfect for toggling between the command-line and a text-editor where one can keep track of what one is doing.

(2) Toggle between the command-line and a text-editor constantly. Rather than trying to write a whole script and then run it from the command-line, it was far easier and faster to simply try commands out, sending them to the standard output, and cobble together the script line-by-line, adding the working commands to a text document. This has three useful effects: (1) Bugs get worked out before they even go into a script, (2) It forces one to document one’s work, as in a lab notebook. This also ended up being quite useful for my lab meeting this week, in which I decided to illustrate some stuff directly from the terminal. (3) It is forcing me to work “properly”, that is sticking with UNIX commands as much as possible.

(3) Learn how the computer is indexing your data. This point is probably the most important, but also the one that is taking me the most effort. I’ll illustrate with an example (which I’ll get into in more scientific detail later):

The output of one of our scripts was a giant 3 column X 1.8 million row table. I wanted to look at a subset of this huge table, in which the values in some of the cells exceeded some threshold. At first I was doing this (in R) by writing fairly complicated loops, which would go through each line in the file, see if any cells fit my criteria, and then return a new file that only including those rows I was interested in. When I’d run the loop, it would take several minutes for finish. And writing the loop was somewhat cumbersome.

But the extremely valuable thing I learned was that R already had all the data in RAM indexed in a very specific way. Built-in functions (which are extremely fast) allowed me to access a subset of the data using a single simple line of code. Not only did this work dramatically faster, but was much more intuitive to write down. Furthermore, it made it possible for me to index the large dataset in several different ways and instantly call up whichever subset I wanted to plot or whatnot. I ended up with a much leaner and straightforward way of analyzing the giant table and I didn’t need to make a bunch of intermediary files or keep track of as many variables.

Next time, I’ll try and flesh out some of the details what I was doing...

Thursday, October 1, 2009

Corrected Logos

Yesterday, I attempted to make some logos out of the degenerate USS simulated data that Rosie sent me. Turns out, I was doing it wrong. After checking out the original logo paper, I was able to figure out how to make my own logos in Excel. I wasn't supposed to plot the "information content" of each base; I was supposed to take the total information content (in bits) of each position (as determined by the equation in the last post) and then, for each base, multiply that amount by the observed frequency of that base to get the height of that element in the logo. So, below I put the proper logos for the selected and unselected degenerate USS sets (background corrected):

Here's the selected set:
Here's the unselected set:

Wednesday, September 30, 2009

Struggling with the Background

Previously, I had shown some preliminary analysis of Rosie’s simulated uptake data of chromosomal DNA fragments. Rosie also sent me simulated uptake data of degenerate USS sequences (using a 12% degeneracy per position in this USS consensus sequence :


So what can I do with this dataset? Well, first, since Rosie also provided me with the “scores” for each sequence, I could plot a histogram of the scores for the 100 selected and 100 unselected sequences, showing that the uptake algorithm seems to work pretty well.

Here's the histogram:
Notably, even the unselected sequences have rather high scores, when compared to the same analysis of genomic DNA fragments. This is unsurprising, since the sequences under selection in this degenerate USS simulation are all rather close to the consensus USS.

Here’s the histogram from the genomic uptake simulation again (just to compare):
(I think the reason for the difference in the “selected” distributions is due to a different level of stringency when Rosie produced the two simulated datasets.)

That’s all fine and good, but now what? With the genomic dataset, I could use the UCSC genome browser to plot the location of all the fragments I was sent, but this consists of two alignment blocks of sequence that look markedly similar.

The obvious thing to do was to make Weblogos of the two different datasets… The unselected set should have very little information in it, while the selected set should contain information. In doing this, I discovered a rather important issue… The on-line version of Weblogo does NOT, I repeat, does NOT account for the background distribution.

This is a problem. It means that whenever you make a Weblogo (on the webserver) from your alignment block, it is assuming that each base is equally likely to occur at a random position. This is why the y-axis in all Weblogos plots always has a maximum of 2 bits when using DNA sequence. Why is this a problem? First of all, if one is using an AT-rich genome, as we are, then the information content of any G or C is underestimated and any A or T is overestimated.

So how does Weblogo calculate the information content of each position in an alignment block? From the Weblogo paper (link found at the Weblogo website):
Rseq = the information at a particular position in the alignment
Smax = the maximum possible entropy
Sobs = the entropy of the observed distribution of symbols (bases)
N = the number of distinct symbols (4 for DNA)
pn = the observed frequency of symbol n

The log2 is there to put everything in terms of bits.

So for DNA (4 bases), the maximum entropy at a position is 2 bits. Makes perfect sense: 2 bits a base. However, this only makes sense if each base is equally probable for a randomly drawn sequence. Now for purposes of gaining an intuition for different motifs, this isn’t really a big deal, although it does complicate comparing motifs between genomes.

When this isn’t the case (probably much of the time), then different measures have been used, namely the “relative entropy” of a position. This is an odds ratio of the observed probability and the background probability. Apparently, the off-line version of Weblogo can account for non-uniform base composition, but I haven’t tried installing it yet, nor any other software out there that handles variation in GC content.

Why? Because what we need for our degenerate sequences is a different background distribution at each position! So, the first position in the core is 88% A, but the third position is 88% G!

To illustrate the problem, here is a Weblogo of the selected set:
Here’s the unselected set:
Looking closely, it is clear that there are differences in the amount of “information” at each position. So in the strong consensus positions of the USS, the selected set has higher “information” than the unselected set, while at weak consensus positions, that’s less true.

But the scaling of each base here is completely wrong. There isn’t nearly a bit of information at the first position in the unselected set. We expected 88% A. The fact that there are mostly As in the alignment block at the first position is NOT informative. In fact, if all was well, we’d get zero bits at all the unselected positions!

What to do? I tried to make my own logo, using the known true background distribution at each position. I won’t belabor the details too much at the moment, except to say that I had to figure out what a “pseudocount” was and how to incorporate it into the weight matrix, so as to not ever take the logarithm of zero.

Here’s the selected set of 100 sequences:
Here’s the unselected set of 100 sequences:
(Note that I somehow lost the first position when I did this, so the motif starts at the first position of the core.)

This actually looks quite a bit better, or at least more sensible. A few things worth noting:
  • I think if we did thousands of unselected sequences, we’d pretty much get zero information from that alignment, which is what we would want, since that’s just the background distribution.
  • Some values are negative. This is expected. Since these are scaled to log-odds ratios, when the frequency of seeing a certain base is less than the expected background frequency, a negative number emerges.
  • The scale is extremely reduced. Every position is worth less than 0.3 bits. This is also expected. One description I’ve seen about how information content can be thought about is how “surprised” one should be when making an observation (there’s even a unit of measure called a “suprisal”!). Since we are drawing from an extremely non-uniform distribution that actually favors the base that’s expected to be taken up by cells better, we are basically squashing our surprise way down. That is, getting an A at the first position of the core is highly favored, but it’s the most likely base to get anyways, even in the absence of selection.
  • The unimportant bases in the USS have the most information content in the selected set. At first this bothered me, but then I realized it was utterly expected for the same reason as above. For example, at position #18 above (sorry, it’s position 19 in the Weblogos), the selection algorithm doesn’t really care what base is there. That means that the selected set will let mutations at that position (from A to something else) come through, which will be surprising, when compared to the background distribution!
(ADDED LATER: Actually, this last point is wrong. The reason for so much information at the weak positions is related to the matrix that was used to select the sequences, not from surprise. I'll try and get a proper dataset later and redo this analysis. To some extent, the positions will still have some information, as partially explained in my erroneous explanation above, but not nearly so much.)

Whew! I’ve gotta quit now. There’s a lot more to think about here.