So I’m a bit over one year into my postdoc. What have I got to show for myself? Well, plenty of work, but not any papers, or even written manuscripts, so that’s a bit of a problem. Can I turn my first set of genome sequencing data into a manuscript?
Seems likely. I collected >5 Gigabases of Illumina sequence data from several
Haemophilus influenzae chromosomes, and this could be used as the basis of a manuscript. I obtained data from a donor strain (86-028NP NovR NalR) and a recipient strain (Rd, RR722) as control data (in order to evaluate the ability of the sequencing and read alignment to correctly identify polymorphisms). I also obtained data from two individual transformants and a pool of four transformants to identify donor alleles in transformed recipient chromosomes. I even found some things out.
Does this a paper make? One outstanding issue is that, in spite of being a lot of data, which has required a fair amount of work to get a handle on, there is not a tremendous amount of biologically relevant data. Yes, I obtained extremely accurate and comprehensive data for the four transformants sequenced. But it was still only four transformants. There are some biologically meaningful results; they just aren’t terribly novel or statistically robust. The bigger biologically meaningful results will have to wait until we can collect more data.
So to turn this into something publishable, the approach and method need to be important enough (and made explicit enough) to be of value to others. So far, I have not done anything in my analysis that is truly novel, but I have managed to produce the bare-bones of a “pipeline” for measuring allele frequencies from pools, and identifying recombination tracts in transformants. The data we got was also extremely high coverage, so we were able to see the limits of the technology fairly well: i.e. depth-of-coverage variation, errors, and issues with read alignment.
Though everything I’ve done so far uses “off-the-shelf” bioinformatics tools, there are so many people trying to do similar things, it might be useful to write a paper that is sort of an “application” of the technology and tools I’ve been using. It took me months to piece everything together, so maybe I could save someone else some time by having everything in one place. But with each passing day, the value of such a paper is probably diminishing, so I’d best get started!
There are still a few analyses I’d like to do that would give the paper a little more spice:
- Structural variant analysis: This is something that will involve our collaborators at UVA, who are experts. We can see these pretty well (at least the larger ones), but something systematic has yet to be done.
- Reciprocal read mapping: I’ve mapped all the data to both the donor and recipient genomes, but I have not really fully leveraged this fact. The read alignment artifacts that arise mapping data from one strain onto the other could be handled much better, if I was able to assign individual reads to either of the two reference genomes, based on the mapping quality. I’d really like to do this. It’d be novel, mainly because most people sequencing are doing SNP discovery. I already have all my SNPs discovered, so doing an extra good job at calling SNP frequencies using reciprocal alignment would be at least something new. This will take a bit of work, however, and I’ll need to figure out the best computational way to do it. Aside from doing uber-detailed error analysis for a technical paper, I think this is really the best chance to make a novel contribution bioinformatically.
Here is a rough outline of the manuscript, as I’m viewing it so far:
1. Introduction
- Many bacteria become naturally competent. Natural transformation is important in evolution.
- Previous sequencing studies have focused on only a handful of defined constructs. (Bacillus, Helicobacter, Actinobacillus).
- For any organism, the total extent of recombined fragments in individual transformants has never been directly evaluated, and the factors dictating the chance of transformation are only poorly understood as a result.
- Haemophilus influenzae is a model system for natural transformation. The mechanism is well-defined. Transformation is efficient in the lab strain.
- The extensive natural genetic variation between H. influenzae strains provides tens of thousands of markers to identify recombination tracts in individual transformants. Not only single-nucleotide, but structural variation in the form of indels and other rearrangments. The “supragenome” hypothesis.
- We investigated the use of massively parallel sequencing (or “next generation sequencing”, NGS) to characterize natural transformation at a whole-genome scale.
- Our results show the Illumina platform to be an excellent method to obtain nearly exhaustive information on recombination tracts in individual transformants. Our approach uses the alignment of sequence reads to both donor and recipient reference sequences. We obtained donor and recipient genome sequence as controls for evaluating sequencing error, depth of coverage, and polymorphism identification. We also obtained the sequence data from two individual transformants and a pool of four transformants.
- Individual recombination tracts are longer than previously appreciated and can bring hundreds of polymorphisms from donor to recipient chromosomes (both single-nucleotide, insertion, deletion, and insertional deletion). However, recombination tracts often appear interrupted by or terminated at sites of structural variation between the two genomes. This shows that such variation are barriers to strand exchange and/or are preferred mismatch repair substrates.
2. Materials and Methods
- Strains
- DNA
- Transformations
- Library preparation
- Illumina sequencing and initial data processing pipeline
- Reference genome alignment by MUMmer, MAUVE
- Reciprocal read alignment by BWA
- SAMtools pileup
- Galaxy pileup parser
- Variant frequency analysis
- Assignment of reads (unimplemented)
- Donor segment calling
- Analysis of structural variation by HYDRA
3. Results
- Genetic transformation of competent cells: Marker to marker variation. Dependence on sequence identity. Congression and linkage
- Illumina sequencing and read alignment: Table of sequencing results and fraction of mapped reads. Variation in depth-of-coverage. Sources of sequencing error and read mapping artifacts. Reciprocal read alignment? Varying alignment stringency?
- Comparison of donor and recipient strains. Identification of SNPs and structural variants between the donor and recipient strains. Comparison to whole-genome alignment methods.
- Identification of donor alleles in transformed recipient chromosomes. Accounting for SV alleles. Identifying novel alleles.
- Identification of allele frequencies in a pool of four transformants.
- Identification of donor segments and putative recombination tracts
- Enrichment of SVs at donor segment breakpoints
4. Discussion
- A first look at transformation… still few transformants
- Excellent method. Limitations are circumvented by very high coverage, knowledge of both donor and recipient genome sequences, and the use of reciprocal read alignment (unimplemented)
- Big recombination tracts. Evidence of mismatch repair. SVs as blocks to recombination tract progression.
- Speculations: Hotspots? Role of uptake specificity? Supragenome transfer?
- Future: Aside from collecting more transformants, making a transformation frequency map to investigate the “cis-acting” factors controlling the efficiency of transformation. Long-term utility in understanding the population genetics of human pathogens.
Still a rough outline, but something to start with....
(continued...)