Tuesday, June 22, 2010

Manuscript plans?

So I’m a bit over one year into my postdoc. What have I got to show for myself? Well, plenty of work, but not any papers, or even written manuscripts, so that’s a bit of a problem. Can I turn my first set of genome sequencing data into a manuscript?

Seems likely. I collected >5 Gigabases of Illumina sequence data from several Haemophilus influenzae chromosomes, and this could be used as the basis of a manuscript. I obtained data from a donor strain (86-028NP NovR NalR) and a recipient strain (Rd, RR722) as control data (in order to evaluate the ability of the sequencing and read alignment to correctly identify polymorphisms). I also obtained data from two individual transformants and a pool of four transformants to identify donor alleles in transformed recipient chromosomes. I even found some things out.

Does this a paper make? One outstanding issue is that, in spite of being a lot of data, which has required a fair amount of work to get a handle on, there is not a tremendous amount of biologically relevant data. Yes, I obtained extremely accurate and comprehensive data for the four transformants sequenced. But it was still only four transformants. There are some biologically meaningful results; they just aren’t terribly novel or statistically robust. The bigger biologically meaningful results will have to wait until we can collect more data.

So to turn this into something publishable, the approach and method need to be important enough (and made explicit enough) to be of value to others. So far, I have not done anything in my analysis that is truly novel, but I have managed to produce the bare-bones of a “pipeline” for measuring allele frequencies from pools, and identifying recombination tracts in transformants. The data we got was also extremely high coverage, so we were able to see the limits of the technology fairly well: i.e. depth-of-coverage variation, errors, and issues with read alignment.

Though everything I’ve done so far uses “off-the-shelf” bioinformatics tools, there are so many people trying to do similar things, it might be useful to write a paper that is sort of an “application” of the technology and tools I’ve been using. It took me months to piece everything together, so maybe I could save someone else some time by having everything in one place. But with each passing day, the value of such a paper is probably diminishing, so I’d best get started!

There are still a few analyses I’d like to do that would give the paper a little more spice:
  1. Structural variant analysis: This is something that will involve our collaborators at UVA, who are experts. We can see these pretty well (at least the larger ones), but something systematic has yet to be done.
  2. Reciprocal read mapping: I’ve mapped all the data to both the donor and recipient genomes, but I have not really fully leveraged this fact. The read alignment artifacts that arise mapping data from one strain onto the other could be handled much better, if I was able to assign individual reads to either of the two reference genomes, based on the mapping quality. I’d really like to do this. It’d be novel, mainly because most people sequencing are doing SNP discovery. I already have all my SNPs discovered, so doing an extra good job at calling SNP frequencies using reciprocal alignment would be at least something new. This will take a bit of work, however, and I’ll need to figure out the best computational way to do it. Aside from doing uber-detailed error analysis for a technical paper, I think this is really the best chance to make a novel contribution bioinformatically.

Here is a rough outline of the manuscript, as I’m viewing it so far:

1. Introduction
  • Many bacteria become naturally competent. Natural transformation is important in evolution.
  • Previous sequencing studies have focused on only a handful of defined constructs. (Bacillus, Helicobacter, Actinobacillus).
  • For any organism, the total extent of recombined fragments in individual transformants has never been directly evaluated, and the factors dictating the chance of transformation are only poorly understood as a result.
  • Haemophilus influenzae is a model system for natural transformation. The mechanism is well-defined. Transformation is efficient in the lab strain.
  • The extensive natural genetic variation between H. influenzae strains provides tens of thousands of markers to identify recombination tracts in individual transformants. Not only single-nucleotide, but structural variation in the form of indels and other rearrangments. The “supragenome” hypothesis.
  • We investigated the use of massively parallel sequencing (or “next generation sequencing”, NGS) to characterize natural transformation at a whole-genome scale.
  • Our results show the Illumina platform to be an excellent method to obtain nearly exhaustive information on recombination tracts in individual transformants. Our approach uses the alignment of sequence reads to both donor and recipient reference sequences. We obtained donor and recipient genome sequence as controls for evaluating sequencing error, depth of coverage, and polymorphism identification. We also obtained the sequence data from two individual transformants and a pool of four transformants.
  • Individual recombination tracts are longer than previously appreciated and can bring hundreds of polymorphisms from donor to recipient chromosomes (both single-nucleotide, insertion, deletion, and insertional deletion). However, recombination tracts often appear interrupted by or terminated at sites of structural variation between the two genomes. This shows that such variation are barriers to strand exchange and/or are preferred mismatch repair substrates.
2. Materials and Methods
  • Strains
  • DNA
  • Transformations
  • Library preparation
  • Illumina sequencing and initial data processing pipeline
  • Reference genome alignment by MUMmer, MAUVE
  • Reciprocal read alignment by BWA
  • SAMtools pileup
  • Galaxy pileup parser
  • Variant frequency analysis
  • Assignment of reads (unimplemented)
  • Donor segment calling
  • Analysis of structural variation by HYDRA
3. Results
  • Genetic transformation of competent cells: Marker to marker variation. Dependence on sequence identity. Congression and linkage
  • Illumina sequencing and read alignment: Table of sequencing results and fraction of mapped reads. Variation in depth-of-coverage. Sources of sequencing error and read mapping artifacts. Reciprocal read alignment? Varying alignment stringency?
  • Comparison of donor and recipient strains. Identification of SNPs and structural variants between the donor and recipient strains. Comparison to whole-genome alignment methods.
  • Identification of donor alleles in transformed recipient chromosomes. Accounting for SV alleles. Identifying novel alleles.
  • Identification of allele frequencies in a pool of four transformants.
  • Identification of donor segments and putative recombination tracts
  • Enrichment of SVs at donor segment breakpoints
4. Discussion
  • A first look at transformation… still few transformants
  • Excellent method. Limitations are circumvented by very high coverage, knowledge of both donor and recipient genome sequences, and the use of reciprocal read alignment (unimplemented)
  • Big recombination tracts. Evidence of mismatch repair. SVs as blocks to recombination tract progression.
  • Speculations: Hotspots? Role of uptake specificity? Supragenome transfer?
  • Future: Aside from collecting more transformants, making a transformation frequency map to investigate the “cis-acting” factors controlling the efficiency of transformation. Long-term utility in understanding the population genetics of human pathogens.
Still a rough outline, but something to start with....

Friday, June 11, 2010

Degenerate Uptake: Pilot Study

Something to blog about! (Wow; it's been over a month... sorry to my three loyal blog readers.)

I’ve gotten around to doing a pilot-scale experiment on the specificity of H. influenzae DNA uptake for the “uptake signal sequence” (USS). The USS is a ~29 base pair motif highly abundant in the H. influenzae genome, and sites that match the consensus USS are known to be preferred substrates for DNA uptake by competent cells. The presence of many USS in the chromosome is presumed to be why H. influenzae competent cells prefer H. influenzae DNA over DNA from other organisms.

However, little is known about how the structure of USS contributes to uptake of USS-containing fragments: Limited analyses of mutations of a DNA fragment containing a consensus USS suggests that some but not all informative positions in the USS motif are important to uptake, indicating that other forces (perhaps later steps in transformation) contribute to the structure of the USS motif.

To carefully dissect uptake specificity for the USS motif, we have devised an enrichment experiment:
(1) A complex pool of DNA fragments containing a degenerate USS library is incubated with competent cells.
(2) The fragments preferentially taken up by cells are purified from the periplasm.
(3) DNA sequencing is used to compare the input and periplasm-purified pools of sequences.

Details and Pilot-scale Results:

I’ve previously discussed the design of the input DNA pools. The control 200 bp construct is designed to already contain the sequences needed for Illumina single-end sequencing, along with a 32 bp consensus USS site near the middle of the fragment. The test construct is the same, except the USS is degenerate, having a 24% chance of a non-consensus base at each position. Thus in the degenerate-USS pool, the average site has ~7-8 mismatches from the consensus sequence.

The expectation is that
, while the consensus-USS construct (USS-C) will be taken up by cells well, the degenerate-USS construct (USS-D) will be taken up more poorly, since it contains many suboptimal sequences (i.e. it is less uniformly delicious). Indeed this is the case, with USS-C being taken up about 10 times better than USS-D at sub-saturating DNA concentrations (see below). The notion is that comparing the USS-D input to that taken up by cells will provide a precise measurement of uptake specificity for the USS (i.e. which sequences are tastiest). We think this will tell us a lot about the mechanism of uptake.

It occurred to me a couple weeks ago that before moving on to the data collection (i.e. the DNA sequencing), I should first make sure that the USS-D fragments recovered from the periplasmic purification are taken up better than the original USS-D input (i.e. the competent cells selected more delicious sequences). This would provide the clearest indication that the experiment worked and the material is worth sequencing. It is!

I compared the uptake of USS-C and USS-D before and after periplasmic purification of taken up DNA from rec-2 competent cells across a range of DNA concentrations. Here are the results:
A and B show the % DNA uptake for USS-C and USS-D, respectively, for different amounts of added DNA (to 200 ml competent cultures). C and D show the same data: C is a dose-response curve, and D is a double-reciprocal plot (since I used 2 ng of hot label, along with an additional amount of cold label for these experiments).

Input USS-C and periplasm-purified USS-C were quite similar, while periplasm-purified USS-D was taken up substantially better than input USS-D.

Notably, at low (sub-saturating) concentrations of DNA, periplasm-purified USS-D is taken up less well than USS-C, while at high (saturating) concentrations similar amounts of DNA are taken up. Also of note is that the input USS-D does not saturate until higher concentrations than the other three samples.

This is all good news. I left out a fair number of details, but this pilot-scale experiments is extremely encouraging. Next week, I plan to repeat the experiment, but this time on an appropriate scale for recovering samples for sequencing. I will also investigate how periplasm-purified USS-D samples behave when recovered from uptake experiments with varying amounts of DNA. I expect that at sub-saturating concentration, the cells will be less “picky”, such that periplasm-purified USS-D will be taken up less well than that purified from saturating concentration. This would provide a useful experimental condition, as in the sequence analysis we would be able to investigate the role of competition in shaping USS specificity.

I think this might end up working swimmingly... Onward!