Tuesday, June 30, 2009

Periplasm Prep Planning II

I tried a modification of a periplasmic protein prep to try and purify uptake DNA, which didn't work. There are several possible reasons why the experiment might not have worked, but one simple reason could be that I failed to dissociate DNA from the membranes and cells when I did the chloroform extraction.

I know! Maybe I should try an extraction that has already been used for purifying uptake DNA...

Kahn, Barany, and Smith (1983) PNAS 80:6927. Rather than describe the paper at length here, I just want to show Table 1 and Figure 4b, which relate to my extraction plans:

The first two columns describe the extraction conditions (rows 1-5). Competent cell cultures were incubated with a radiolabeled plasmid and pelleted after DNA uptake (5 or 60 min). Cell pellets were then resuspended in the indicated aqueous and organic solutions (columns A and B) in a 1:1 mixture, gently mixed, centrifuged to separate the phases, and radioactive counts in each fraction were measured.

The remaining columns indicate the relative amount of uptake in the different fractions and the identity of the radiolabeled DNA, either transformed into the chromosome (C) or still a double-stranded donor DNA molecule (D).

In the fourth condition (row 4), the aqueous phase consists of mostly donor DNA! So chromosomal contamination is in the pellet, and the desired donor molecules are in the aqueous phase. Sounds like a scheme. That’s what I’ll proceed with tomorrow.

Why not condition 1, just TE and phenol? Looks good, right? Because of Figure 4B:

The bottom line is that the phenol condition degraded the donor molecules (lane D), whereas the phenol/acetone condition did not (lane I). Here’s the gory details:

Lanes A-D describe phenol extraction of intact donor DNA molecules (Table 1, row 1):
A: input plasmid donor molecule.
B: total DNA from cells 4 min into uptake.
C: DNA extracted and dialyzed out of the pellet.
D: DNA extracted into the aqueous layer (TE).

Lanes E-J describe the phenol/acetone extraction of intact donor DNA molecules:
E: input again.
F: input cut with HindIII.
G: total DNA after 8 min uptake.
H: same as G, but digested with HindIII.
I: phenol/acetone extracted DNA after 8 mins.
J: same as I, but digested with HindIII.

The point of the HindIII digestion was as an additional test for whether molecules were donor or chromosomal. Chromosomal DNA is resistant to HindIII (this being Haemophilus influenzae after all), while donor DNA is not. Also, the HindIII digests show that the recovered DNA is double-stranded, since ssDNA won’t get cut.

Condition 4 it is, then:
Aqueous: TE/1.5 M CsCl
Organic: phenol/acetone, 1:1

That's what I'll try next.

Hmm, I’ll have to remember how to clean the DNA of CsCl after the extraction...I seem to remember doing something in particular at one point...

Periplasm Prep


Experimental plan for the day: Medium-scale periplasm prep test-- to purify double-stranded DNA in the “protected state” (the periplasm).

I have two PCR products: (1) a “good” uptake sequence USS-1, and (2) a “bad” uptake sequence, USS-R . I want to compare their uptake into rec-2 cells, which can bring DNA through the outer membrane, but not the inner membrane. This means that if I can specifically enrich USS-1, but not USS-R--and can see the difference in a gel-- then I’ve got a functioning periplasmic DNA prep. Naturally, it’ll probably take several attempts to get working...

I will use a modification of this paper and see if it nets some DNA where it should be. In outline, I’ll: Add chloroform to washed cell pellets. Soak. Extract periplasm with TE. Clean and concentrate. Run on a gel.

Based on other studies with radiolabeled USS-1 uptake, I expect that for 20 ng added to a 1 ml culture, ~50% will be taken up. To see uptake DNA without radiolabel on a gel and for reasonable controls, I will need larger cultures of competent cells than I aliquoted and froze last week.

Here’s my protocol so far:
1) Defrost two tubes of rec-2 (0.3 OD/ml aliquot) into fresh sBHI@37 (2X25 ml); wait ~2-2.5 hrs.
2) At OD600=0.3 / ml, transfer cells to M-IV by filtration.
3) Incubate 100 min @37 to induce natural competence. (negative control: frozen tube of rec-2 (0.1 OD/ml) into fresh sBHI (25 ml).)
4) Split cultures 2X and incubate with 20ng 222bp PCR fragments (USS-1, USS-R, none) / 1 ml M-IV culture (~10^9 cells) for 15-30 min @37, DNase I, EDTA to kill DNase I and other nucleases. Also add USS-1 to non-competents.
5) Spin, wash pellet 3X PBS, chloroform (20-40ul), incubate 20 min @RT (chloroform pellet DNA extraction?, save washes).
6) Extract with 100-200ul cold TE, proteinase? RNase?, p/c extraction, PCR clean-up column (or ppt?) to concentrate.
7) 1.2% agarose gel. Lanes:

Size standard
USS-1 input (2X dilution)
USS-R input (2X dilution)
rec-2 + USS-1 -> chloroform extract
rec-2 + USS-R -> chloroform extract
rec-2 + no dna -> chloroform extract
non-competent rec-2 + USS-1 -> chloroform extract


Didn't work. A few little mishaps aside (mainly that the chloroform and cells really didn't mix well), I got no USS out of the prep, but did have a fair amount of chromosomal contamination. So clearly, I didn't really get the periplasm specifically, but since I didn't see any USS come through, it may also be that my competent cells weren't really.

I'll try to go through this again tomorrow, but instead of going straight for the periplasm preparation, I'll just lyse the cells, extract the DNA, run it over a mini-prep column, and run it on a gel. I'm not going to worry about the periplasm specifically, but simply that the cells are taking up DNA. When the radiolabel shows up, I can repeat the uptake assay the lab has typically done.


Friday, June 26, 2009

Plan for the next few weeks

One proposal down, one to go... The next one isn’t due until August 8th, so I’ve got just over a month to get it done. This time, however, I am going to manage my time better, since I need to get some preliminary data and still keep learning how to use a computer.

So here’s my plan for the next several weeks:

(1) Work on the proposal for a limited time each day (~1-2 hrs). I’ll start by developing a detailed outline of what I want to say and the order I want to say it, rather than leaping straight into writing. Based on my experience with this last one and in the past, I find I am an extremely inefficient writer (both with my time and with my words), so hopefully I can improve by having more focused daily goals.

(2) Work on the computational stuff only 1-2 hrs per day. Still need to fix the browser to display the Hin genome. Still want to work out the best way to align the genomes and report differences... particularly enumerating structural variation (non-SNPs). I also need to keep a mind towards what file formats I expect to get from sequencing. It might be particularly useful to try and simulate the kinds of results I might expect from sequencing periplasmic uptake DNA, etc.

(3) The rest of the time will be dedicated to lab work. The priority is to use defined fragments (USS-1 and USS-R) to work out a periplasmic DNA purification protocol. I’ve got cleaned amplified USS-1 and USS-R fragments, and I’m making competent cells of wild-type and rec-2 today. I tried to grow up a pilA mutant to use as a no uptake control, but something was wonky with the strain. All I need now is label. And a calibrated Geiger counter. I’ll get these things done today.

(The image above was made using the Mac-specific application GenomeMatcher. It represents a BLAST alignment between KW20 and 086-28NP across an interval containing several inversions. It also has a bunch of useful-seeming utilities that I'd like to figure out. Now if I could just get it to use my MUMmer program, like it’s supposed to...)

Thursday, June 25, 2009

But which would win in a fight?

The budding and fission yeasts provide for several interesting comparisons. Some of these somewhat illustrate my considerable confusion about the “evolution of sex”. Here’s one:

While Saccharomyces cerevisiae (the budding yeast) prefers to spend time as a diploid in the G1 phase of the cell cycle (with two unreplicated genomes), Schizosaccharomyces pombe (the fission yeast) prefers being a G2 haploid (with one replicated genome). These “preferences” are inferred from two pieces of evidence from each of the yeasts:

(1) Budding yeast cells mate under rich conditions (become diploid), and sporulate under starvation (become haploid). Fission yeast cells remain haploids in rich conditions, while mating and immediately sporulating under starvation (zygotic meiosis).

(2) In cycling cells, budding yeast spends most of the cycle in G1. S-phase and mitosis are separated by only a short G2. By contrast, fission yeast cells reside in G2 most of the time, with only a short gap between mitosis and S-phase.

The interesting thing here is that, while the two yeasts prefer to maintain different ploidy levels (2N versus 1N), they both prefer to have two copies of the genome present (2C). Why might this be?

One suggestion is that this means that there will always be a template for recombination in the event of a DNA double-stranded break (DSB). DSBs are a particularly challenging form of DNA damage for cells. They must paste back the correct broken ends, which requires an intact homologous (identical-by-descent) template to happen with high accuracy.

Another possibility relates to newly introduced recessive mutations. In both cases, a recessive mutation will not normally affect the phenotype of the cell it occurs in, since there are two copies of the genome. But cell division has different consequences in diploid versus haploid cells. In a diploid, mitosis will maintain the heterozygosity of the locus sustaining the new mutation, while in a haploid the wild-type and mutant alleles will immediately segregate in mitosis. So natural selection will act differently on populations that are predominantly diploid or haploid.

An important point regarding the G2 preference of fission yeast: Even a new lethal recessive mutation will not usually kill a particular cell. A cell sustaining such a mutation in G2 segregates the lethal allele to only one of its progeny sister cells, maintaining the wild-type allele in the other.

Diploid cells can still segregate the wild-type and mutant alleles, even in the absence of meiosis. “Loss-of-heterozygosity” (LOH) can occur, if there is a crossover between the heterozygous locus and the centromere of homologous chromosomes. Such crossovers, though rare, occur at measurable rates due to recombinational repair of DSBs and collapsed replication forks. So even in the absence of proper sex, a diploid can still expose its new mutations to natural selection, just at a substantially lower rate than haploids (over many cell generations only some LOH will occur in the budding yeast, but will happen immediately for new fission yeast mutations).

This comparison illustrates contrasting life style choices. While being haploid or diploid has distinct population genetic consequences, the preference of both yeasts to exist with two genomes in a cell says something interesting... I’ll need to think a little more about what that something is before articulating it clearly...

Tuesday, June 23, 2009

Proposaling continues

I continue on the saga of writing this soon-to-be-due proposal. I keep struggling with things that I think I understand in my head, but can't succinctly talk about in the text. Luckily, when I hit a wall, I can do some referencing and figure-making...

Thursday, June 18, 2009

A taste of DNA

I gave Rosie a draft of my grant application, so decided to dither with human USS motifs again.

First, for fun, I looked at where several of them were located... to see which genes taste best... and ran across USS in all sorts of random genes. (For example: a kinesin, an adductin, a phosphatidic acid phosphatase, a phosphatidic acid kinase(!), a cadherin-associated protein, a few hypothetical genes and transcriptions factors, etc. etc.) There were also several located outside genes and in gene-poor regions. It would be funny to do a GO annotation analysis, but most certainly a waste of time...

But to address Rosie’s comment: I found 762 10mer USS motifs in the human genome using a Tagscan search. This looks like it meets random expectations, since using Tagscan to count arbitrary 10mer motifs having similar base composition (and a CpG) gave similar numbers.

However, my analytical calculation of expected seemed way off. Where did I go wrong? (One reason why I might actually care about this is so that I could do a more precise analytical calculation of the number off USS motifs I’d expect in the Haemophilus genome.)

I had calculated the chance of getting a USS motif for a random 10mer as G* (25%)^10, for a 25% chance of drawing the correct base at each position (where G is genome size = 3.16e9 bases):

So G * (25%)^10 = G * 1 / 1,048,576 = G * 9.54 e -7 = 2,956 instances.

But since the human genome has only 41% GC content, I might adjust this measurement to be (20.5%)^5 * (29.5%)^5, for a 20.5% chance of drawing the correct base at GC positions and a 29.5% chance of drawing the correct base at AT positions:

So G * (20.5%)^5*(29.5%)^5 = G * 8.09 e -7 = 2,556 instances

These are for only a single strand of DNA. Since there’s also the reverse complement, we do have to multiply these values by 2. I can’t think why this wouldn’t be. So...

50% GC : 5922
41% GC : 5112

So even accounting for %GC failed to bring this calculation down to what Tagscan found.

What about dinucleotide composition?

It’s well known that mammalian genomes have a dearth of CpG dincucleotides, since these are used as sites of gene regulation by cytosine DNA methylation. Methylated cytosines tend to deaminate into thymidines, so there is a mutational pressure on CpG to go to TpG dinucleotides.

Anyways, I managed to find a nice table in this paper reporting dinucleotide composition in several genomes. The paper the table’s authors used for humans was dated from 1962, so these are probably not particularly precise numbers but sufficient for my purposes. It looks like the CpG dinucleotide is ~4-fold less than would be expected for a random genome with base composition like humans’.

This paucity of CpG dinucleotides in the human genome could account for the discrepency. So just for giggles, I ran a few additional Tagscans for 10mers with the correct base composition but lacking a CpG dinucleotide, giving the following numbers: 9143, 6698, and 4343 motifs. Those numbers look a lot more on-target (not terribly precise, but more accurate).

But how can I actually use the known distribution of dinucleotides in the human genome to arrive at an even more accurate estimate of how many USS motifs I’d expect to see? I thought up some crummy ways to sorta-kinda account for the CpG deficit, but would ideally like to be able to use arbitrary dinucleotide frequencies and arbitrary %GC to produce an expected value. I have a terrible feeling I need to use Markov or Ising models and just don’t have the heart for it right now...
As an aside, UNIX is pretty awesome. To figure out the number of motifs Tagscan was finding (their output is a list of matches), I simply typed:
> wc -l filename

And that gave me the number of lines in the file. The first row was the header, and the last row was blank, so subtracting 2 from the number gave me the total number of motifs.

If I had Tagscan (and the human genome) on my computer, I could fairly easily set up a script to iterate through a whole bunch of 10mers with specified parameters and draw up a distribution. If I was really cool, I'd exclusively use UNIX commands to do this. This would then allow me to ask what the significance of the number of USSs would be. (Obviously p > 0.05, but that would be one way to do a real statistical test, even if I never figured out how to work out the expected value by an analytical method.)

Tuesday, June 16, 2009

They're eating our DNA!!!

I am in the midst of proposal-writing, which makes blogging tougher, but when I hit a writing-block at the end of the day, I decided to dally with a random thing I'd been meaning to figure out.

Several friends of mine, when I tell them about my plans with the naturally competent bacteria, have said, "Dude, you should feed them human DNA!"

Sound silly? Maybe it is, but I went ahead and used TAGSCAN to search the human genome for the 10-bp core USS motif: AAAGTGCGGT and found 762 instances of the USS core. BLAST and BLAT didn't want to deal with me due to how short the query USS motif is.

That means Haemophilus influenzae might find nearly 800 bits of our genomes tasty! I remember from my reading that there's nearly 200 micrograms of DNA per milliliter of our lung mucus, which seems like a heck of a lot. Much of this DNA is probably human...

(For context, on average Haemophilus have a USS less than every 2 kilobases, but humans have an average USS density lower than one in 20 megabases. So while humans have half as many USSs, there's a 10,000 times lower density.)

It isn't surprising the human genome contains at least some "USS". A random 10mer string would have a 1/(4^10), or a little less than 1 / 1,000,000, chance of being the USS motif. But the human genome is more than 3 billion base pairs. So randomly, we might expect to find more than 6000 USS (3 billion / 1 million X 2). (The 2 is to also count the reverse complement.)

If I did that right, then the observed number of USS motifs in the human genome is rather less than expected. That's interesting...

However, that was a crude estimate of the expected number of USS motifs. The USS core sequence has 5 GC bases and 5 AT bases, giving a GC base composition of 50%, whereas the human genome has only 41% GC content. There's also the issue of dinucleotide frequencies. For example, the CpG dinucleotide is underrepresented in the human genome, but happens to appear in the USS. I've been trying to figure out a rational way to incorporate this type of information into my estimate of expected, but so far have failed to do so properly. Regardless, my estimate of expected is certainly too high. The question is: how much so?

To produce control numbers, maybe tomorrow I'll run a few other TAGSCANs for arbitrary 10mers with the same GC content and a CpG and see how many come up. I haven't really looked at the distribution of USS, except that the number of USS per chromosome is highly correlated to chromosome size (R^2 = 0.91). I might also predict that USS will fall into more GC-rich regions of the genome.

But for fun, assuming that the result held, what might it mean if the USS motif is significantly underrepresented in the human genome? I can hardly imagine that Haemophilus could be responsible ("it ate them!!!"), but maybe it could work in the other direction? Perhaps the USS motif is only coincidentally somewhat rare in humans, and as such makes a good sequence to use for preferring conspecific DNA uptake? If the USS motif was extremely abundant in the human genome, then naturally competent Haemophilus might not take up conspecific DNA as easily? Hmmm...

Regardless, think about that when you fall asleep, folks... There's bacteria inside you, and THEY'RE EATING YOUR DNA...


Putting my proposal together continues, but in a bout of procrastination I did go ahead and run a few random 10mers with the same base composition (and a CG dinucleotide) through TAGSCAN, and it looks like the observed number of USS motifs is about expected for 10mers of similar composition.

The USS motif:

Randomized USS motifs:

Nothing to see here... Move along...


Tuesday, June 9, 2009


Howzabout that “supragenome”?

Polymorphic gene content

Genome sequencing efforts have captured substantial variation in the gene content among closely related bacteria. For example, this nice study by Hogg et al. 2007 compared the genome sequences of thirteen Haemophilus influenzae: They found a “core genome” of ~1500 genes, along with another accessory (or contingency) genome of ~1300 genes. Any given isolate had a subset of the accessory genome, numbering around a few hundred extra genes beyond the core in each isolate.

So any two isolates have substantial amounts of DNA that is unshared. For example, Hogg et al. report that the Kw20/Rd and 86-028NP isolates differ by nearly 400,000 bp within only ~250 indels (Table 5-- mean size: ~1.5 kb, median size: ~300 bp). The genomes are < 2 Megabases long, so that’s about 20% of these chromosomes that is non-homologous.

(As an aside, the Hogg et al. paper’s methods section introduced me to MUMmer, which I previously discussed. The indels and other rearrangments are essentially defined by breaks in the alignment produced by the nucmer utility. More on this in the future...)

There are several possible arguments related to uptake specificity and the "supragenome":

Uptake specificity for variation?

Large variation in orthologous gene content suggests to Hogg et al. and others a “distributed genome hypothesis”, in which natural transformation can shuffle (or re-assort, or segregate) the accessory genes between isolates. This would then presumably allow for the rapid acquisition and loss of different genes (diversification) from within a given genetic background and thus perhaps rapid adaptation to environmental changes (or shifting host defenses). The “distributed genome hypothesis” is then implicitly related to the “sex hypothesis” for the maintenance of natural competence.

Uptake specificity for conservation?

On the other hand, natural transformation could also maintain the “core genome”. Thus, if there is plenty of conspecific DNA uptake, any bit of “core genome” taken up could be replaced in a cell that had lost it. This nice study by Treangen et al. 2008 using several neisserial genome sequences showed that DNA uptake sequences (DUS, the neisserial equivalent of USS) existed at a higher density in “core” regions of the genome than in the substantial alignment gaps between isolates (containing indel poymorphism).

Again, there is some indication of the “sex hypothesis” for the maintenance of natural competence, but it works in the opposite direction, maintaining the core rather than shuffling the accessory. I think the argument goes like this: (1) The core genome likely defines the more essential portions of the genome, since by definition any accessory genes are not required to live. (2) DUS could have been selected for within this partition of the genome, since it would help to maintain the more essential gene functions within a population. (3) Therefore the high number of DUS sequences could be a product of natural selection to maintain the integrity of the “core genome”.

Uptake specificity for no reason in particular?

However, there’s another possibility the authors partially explore that does not involve selection for DUS distributed throughout the genome, but represents almost the opposite model. Instead of selection, pehaps DUS accumulate due to happenstance intrinsic biases in the uptake and/or recombination machinery by a neutral molecular drive. So sequence variants that arise with a higher chance of being taken up later are more likely to spread through populations than variants with a lower chance of uptake. Thus the “core genome” could partially be that way, i.e. conserved across isolates--not exclusively because of essentiality or usefulness--but also by virtue of containing lots of DUS. So rather than DUS being selected for in order to maintain the core genome, segments of DNA containing DUS are simply mre easily replaced in lineages that lost them.

An affiliated idea suggests that if some accessory genes were from distant relatives and arrived by horizontal transfer by some mechanism besides natural competence, these sequences would not have had time to accumulate uptake sequences yet. Thus the paucity of DUS in the accessory genome might be in part due to the more recent arrival of that sequence in the genome, so the effects of drive have not yet become evident, rather than a specific selection pressure to maintain DUS in more important segments of the genome.

(The Treangen et al. paper introduced me to another genome alignment tool called M-GCAT. I’ve played with it a bit and managed to produce some figures effectively the same as what appears in their supplementary data-- the picture above (along with alignment files resembling multi-FASTA format) but have the unfortunate problem of being unable to re-load analyses I’ve performed later due to some kind of Python error. More in this later as well...)

How to analyze the core and accessory genomes myself?

I’ve clearly got a lot more thinking to do regarding these core and accessory genomes... Especially in light of the horizontal gene transfer issue.

But first I’d better figure out simply how to define the core and accessory genomes more specifically.

I’ve begun this by examining the gaps in the .rdiff and .qdiff output of dnadiff (a pairwise comparison of two genomes) to try and do some basic analysis myself. In a future post, I’ll report on my progress with this, but for now, I’ll just mention that most of the gaps are not strictly insertions or deletions, but are rather insertional deletions. Alignment gaps include both reference and query bases. But I still need to try and understand how dnadiff produced its .report output before I can get much further...

Friday, June 5, 2009

Joint molecules...

There are several classes of transformation event that could be mediated by natural competence and recombination. The first thing that I had to wrap my head around (after having come from the double-stranded break world of recombination) was to realize that uptake DNA recombining into a host chromosome is single-stranded. I’m more used to drawing recombination models involving two broken ends of a double-stranded molecule.

So to kick things off, in the above figure, I’ve drawn what I think the joint molecule intermediates look like between polymorphic donor ssDNA invaders and recipient dsDNA substrates for the four major classes of transformation products I can envision.

The red (red/pink) lines indicate an ssDNA that’s found its way from outside the cell to the host chromosome (in blue/light blue). Aligned regions are base-paired. Unaligned regions are not...

The amount of transformation will depend not only on how many of a given joint molecule are formed, but also how they are resolved and the rate and directionality of “mismatch” correction.

One important aspect of the strand invasions depicted above will be the extent of heterology. The more polymorphisms in the donor sequence, the less stable the joint molecule will be.

Each class of polymorphism deserves special mention:

(1) Single nucleotide polymophisms: There are 12 possible heteroduplexes involving single-nucleotides, if we keep track of donor and recipient. So any of the 4 bases could go to any of the remaining 3... 3 X 4 = 12. In my preliminary analysis, I found nearly 39,029 SNPs between our reference Rd strain and the clinical isolate 86-028NP falling into each one of these different classes (with a paucity of G->C and C->G changes). Even without precisely measuring the frequency of transformation for each individual SNP, we may be able to estimate the relative transformation frequencies for these different classes of SNPs.

(2) Insertions: If the donor has an insertion relative to the recipient, recombination could yield an insertion. In the case of insertions, the uptake sequence can be anywhere on the fragment. base pairing of the substrate will require the formation of a loop. The size distribution of the input DNA will also dictate the size of possible insertional recombination. How the mismatch correction machinery handles insertional joint molecules like this is unclear to me. My preliminary analysis indicated 149 insertions in 86-028NP relative to Rd (mean size = 1162 bp, but the median only a few hundred).

(3) Deletions: I made this distinct from insertions for two reasons: (1) The uptake sequence (obviously) must flank the deleted segment. An absent piece of DNA can’t be used for uptake. This imposes directionality on deletions that’s distinct from insertions. (2) It’s unclear how the mismatch machinery would act on the two central joint molecules depicted... It restoration repair more likely in one case than in the other? My preliminary analysis indicates 137 deletions in 86-028NP relative to Rd (mean size 1904 bp, again with a much lower median).

(4) Rearrangements: I’ll give this very short shrift for now. If a particular piece of ssDNA spans a rearrangement breakpoint, then strand invasion of each end into two separate chromosomal positions could mediate an inversion or several other kinds of rearrangement. My preliminary analysis with MUMmer suggests ~9 inversions and 26 transpositions between the two strains.

There are many many ways in which rearrangements in the donor might produce fragments that could mediate a rearrangment of the recipient. It’s going to take some time to enumerate these. Indeed, even if the donor was perfectly colinear with the recipient, some fragments could still mediate rearrangements.

For example, looking at a close-up of the MAUVE output between Rd and 86-028NP shows that the rRNA gene cluster of 23S and 16S (in Red) often span the rearrangement breakpoints between the two strains:
This could arise in a traditional mutational way (i.e. recombination between inverted repeats would cause an inversion), but could also be mediated by transformation. For example, if a DNA fragment that contained a bit of rDNA and a bit of flank first invaded the wrong rDNA copy, the other end could then grab the original flank... causing a rearrangement.

That is, in the rearrangements-involving-repeats scenario, it would be quite possible to produce rearrangements distinct from either the donor or recipient DNA.

This is going to take a while to flesh out, as there are several possible outcomes to this sort of thing. It doesn't make things any easier that I also now have to consider that the chromosome is circular... sigh... I'd better try and remember what plectonemic and paranemic mean...

Nevertheless, the upside is that it will be possible to measure indel and rearrangement transformation rates with considerably more ease than the SNP class, due to the use of “spanning coverage”.

Searching for Cash (How I'll spend my summer vacation)

So it’s time to outline my plans for the next several weeks... Grant-writing and preliminary data gathering!

Two grant applications to do:
(1) Michael Smith Foundation for Health Research (due June 25 to Office of Research)
(2) National Institutes of Health (due August 8)

I’ll find out whether or not to submit a full application to (1) in the next several days (if my “letter of intent” was sufficiently cool-sounding, I guess). The proposal itself is short (only 3 pages), so I should be able to focus on having a well-written piece in the next couple of weeks. This will help me a lot with the other proposal too. (I’d also better solicit letters for that one soon, but I want to wait until they tell me to apply first!)

And for (2) much editing, writing, and analysis to do! Rosie and I had begun tackling my rejected NIH proposal after I first got here, but enough time has passed that we’re both out-of-the-loop on our own editing. There are several things to do, besides polish the writing:

First off, the basics of the proposal are these: (a) I want to feed genomic DNA from one Haemophilus influenzae strain to a competent cell culture of another strain. (b) I’ll purify the donor DNA from various cellular compartments (representing different stages in the natural transformation pathway). (c) Then, I’ll use sequencing to measure the relative abundance of different DNA sequences along the pathway. (d) This should give me a comprehensive view of transformation potential of one genome into another.

There are a huge number of possible analyses, each of which may or may not be interesting. One major issue is how to present the kinds of analysis I’d plan on doing (and showing the reviewers that I can do it). And how to show that it really is an important set of experiments that I am capable of doing.

Proposal writing: Alongside simply improving the writing, I need to specify more details on how I will conduct the data analysis and provide any form of preliminary data that I can. These will also help us with the larger DNA uptake proposals we plan to submit in the Fall.

Background: I need to make the importance of the research plan and the specific questions I’ll answer much more clearly written and accessible. I also need to better understand the history of uptake signal sequences and how they were discovered. I.e. what’s already known versus what I’m going to learn. Rosie and I have talked this section through pretty well. It just needs to get re-written now.

Preliminary analysis: In my first version, this was just more background, but since I'll have been here a few months by the time I resubmit, might as well show what I've been doing...
  1. I need a more direct comparison of the donor and recipient genomes I’ll be using. I’d previously culled data from this paper, but showing that I can do the comparison between the genomes myself will likely go a long way with the reviewers. I’ve got some pre-preliminary analyses here in this blog, but I'm still feretting around with that data and still learning how the alignment algorithms work.
  2. I want to demonstrate that I can purify donor DNA from the periplasm and/or cytosol, so that reviewers know that I can obtain the material I want to sequence. I'm going to start with the silly way, then move onto more sophiticated periplasmic space enrichments, as necessary.
  3. It could be good to have preliminary co-transformation data to get a more realistic estimate of sequence coverage needed to to our global transformation rate experiments. The biggest help here would be to know the identity of the antibiotic resistance alleles I've been using. And maybe to do whatever it takes to get really high rates...
Experimental methods: My first version seemed okay to me, but upon re-reading, it was extremely dry and didn't give too many specifics about the downstream data analysis.
  1. Should I keep the experiment using degenerate USS oligos in this grant proposal? It’s a nice experiment, but may require more explanation than I really have space for. I’ve got a better idea now about what the experiment might look like in real life, but it may draw away from the toher components too much to talk about here.
  2. I need to more clearly and succinctly discuss the sequencing, particularly distinguishing the difference between “spanning” and “sequence” coverage.
  3. The periplasm/cytosol experiments may be real overkill as written. We may want to bring up the possibility of doing competitive experiments between multiple donor genomes, if our single donor experiment goes well. Analysis-wise, it might be nice to show a mock figure of some possible expectations. For example, I discuss but do not illustrate what an uptake blocking sequence would look like.
  4. The co-transformation experiment can be improved. Based on the way things are going with the Illumina GA2, we could very well get a lot more out of this than I’d thought. If we could fully sequence 40 independent transformants, we could say a lot...
  5. For the bulk transformation experiment, it may make sense to break it into two rounds: In one round, we’d have enough coverage to do a strong analysis of large indel and rearrangement transformation. This could also use a figure or illustration. Since we’ll have high spanning coverage, measuring these rates will use considerably less sequencing cash. If this goes well and we’ve developed a good analytical pipeline, we can then sequence a lot more and pick up the little rearrangements and all the SNPs. At this point, we’d also have a much better idea of how much sequencing we’d need to do to get to whatever level of sensitivity we want to get.
Supplementary essays: Based on the reviews of the grant, the components of my application other than the proposal appear to have been lacking. Here’s some things I need to do to improve these things in the next round:
  1. Have a better career plan. I made this essay very short, but the reviewers had some qualms that my research experience did not point at a future coherent research direction. I’ve got plenty of notions in my head about how I might conduct an independent research program, but I need to figure out how to bridge my eukaryotic and prokaryotic research directions together more robustly.
  2. Submit 1 or 2 manuscripts from graduate school. I’d said I had two manuscripts in late stages of preparation, which was true. But it’d be nice to actually have these submitted. I’m meeting with my PhD adviser and a co-author next week in California to try and hammer out the last few details of one of the manuscripts. I may still be able to submit the other before the August deadline, since it’s written and only needs some figure-fixing and final edits. But it’s been a long time and will likely take me and my PhD adviser some time to get our brains wrapped around it again.
  3. Since I tout my desire to be a teacher, I need to formulate a precise training plan to learn to teach during my postdoc. Rosie has referred me to several possible leads on campus for this purpose...
  4. I need to re-orient my research experiences to seem less haphazard. It’s true that after graduate school, I spent a year exploring my options and helping some friends set up their labs, but there was indeed a method to my madness...
  5. Letters of support: I need to make sure these letters are perhaps a bit more gushing.

The Style Book. I’m about half way through. It’s mostly filled with things that seem common sense, but reading it has helped me a bit, I think...


Wednesday, June 3, 2009

There's a lesson here somewhere

A recent email exchange with a local computer expert, explaining one reason I normally avoid directly seeking help until I really need it (this also applies to statisticians):
Me (paraphrased): Er... help... with things...
Me (a little later): It worked like a charm! Soon after I asked you for help, I worked it out... Figures.
Local Expert (soon thereafter): Glad to be of service!
Nevertheless this method is tried and true! Thanks, Alistair!

Here's another tool I recently checked out...

CGView made a rather nice figure for default settings: (from out to in) genes, %GC base composition, GC skew. There seems to be several other ways to further configure the file. In particular, I was interested in the plot of GC skew. Changes in sign (between purple and green) can indicate origins and termini of replication, due to mutation rate differences between the leading and lagging strands of replication. It's not immediately obvious what position should be the starting coordinate for a particular genome, but for PittEE, I suppose I'd guess the origin as the hour hand a little before 6:30...

Details on how I made that:
I didn't have too many issues running the package. It needed a bit of BioPerl (Bio::SeqIO) to interpret the GenBank file, which I knew I'd downloaded in my GBrowse installation quest, but for some reason, I needed to specify the correct Perl sublibrary to a variable called $PERL5LIB.
> export PERL5LIB=$PERL5LIB:/sw/lib/perl5/5.8.6/

I still have a lot more to learn about working at the command line, since each time I start a new terminal, I need run this command again, if I want to use CGView.

After that, I went to the directory containing the GenBank file for one of the complete genomes (86-028NP.gbk), and typed:
> perl /path/cgview/cgview_xml_builder/cgview_xml_builder.pl -sequence PittEE.gbk -size small -output PittEE.xml
(/path defines is the place I put the CGView directory that downloaded. Mine was in a directory /path = /Users/my_name/bin)

This converted the GenBank file into an XML file suitable for reading by cgview.jar. Pretty cool! (I didn't have to think about JAVA being there, since it was already installed.)

Then invoking java:
> java -jar /path/cgview/cgview.jar -i PittEE.xml -o PittEE-map.png -f png
Presto! The figure above.