Thursday, September 17, 2009

Reverse engineering

Okay, so now that we’ve exposited all the brilliant experiments we’re planning to do while writing proposals, the actual reality of doing the experiments is starting to sink in. We’ve also managed to put down some fairly concrete goals for the next several months.

One of our experiments involves measuring the specificity of DNA uptake by naturally competent H. influenzae for fragments containing “the genomic USS motif”. The H. influenzae genome contains an abundant sequence motif, and fragments bearing it are taken up better than fragments that don’t. This “uptake signal sequence” was originally defined by its functional role in DNA uptake, but has since been characterized mostly by bioinformatics, with no direct uptake specificity data. The limited data from previous lab members suggests only an imperfect correspondence between the properties of the genomic motif and the specificity of DNA uptake.

The idea, then, is to feed competent cells small DNA fragments bearing a degenerate (highly mutated) version of the USS consensus sequence, recover those that are preferentially taken up, and sequence the resulting pool. USSs are ~32 bases, well within the reach of single-end Illumina reads, if they are positioned properly next to a sequencing primer.

I’ve previously discussed the expected properties of a degenerate USS pool. And though I think we need to consider this more, I will focus this post on the design of other parts of the construct that will allow us to circumvent subsequent sequencing library construction steps. Illumina sequencing uses specific sequences added to the ends of molecules to capture and sequence DNA of interest...

Properties needed for a USS-containing construct, where Illumina sequencing can be directly performed to sequence the USS:

(1) SIZE: ≥200 base pairs, dsDNA. 200 base fragments with USS are efficiently taken up by cells, and the size is sufficient for efficient cluster synthesis and sequencing using Illumina’s Genetic Analyzer.
(2) CAPTURE SEQUENCES: One end of a strand of each fragment needs to be able to anneal to one of the two “Flow Cell Primers” (FP) in the Illumina flow cell, while the other end of the same molecule needs to contain the reverse complement of the other FP.
(3) SEQUENCING PRIMER BINDING SITE: The reverse complement of Illumina’s sequencing primer needs to be immediately downstream of the reverse complement of the USS. (This could work the other way, but getting the “sense” USS directly from the sequencing reads seems optimal).
(4) TAG SEQUENCE: The first few (four) bases of each read should be in non-degenerate fixed sequence to facilitate the alignment of the degenerate USS reads.
(5) CONSTRUCTION: After consulting several oligo makers, we learned that we wouldn’t be able to get our degenerate constructs built into an oligo longer than 130 nt. This means that I will need to anneal two oligos together and extend with polymerase to generate a full-length construct.

The first trick was to actually find out what the normal Illumina adapter and primer sequences were. They were available on-line, and I think I’ve mostly reverse-engineered what the different bits do. And think I have a reasonable design:

I’ll order two oligos, one 130 nt and the other 106 nt. (At the end of this post, I will list the exact sequences of each part and some notes.) They’ll have 36 bp of reverse complementarity at their 3’-ends, so that I can anneal them and extend to produce full-length construct.
To illustrate what all the different parts of the construct are for, here’s a color-coded version, for which I’ll schematically diagram the Illumina cluster synthesis and sequence priming.
The key features are that the flow cell primers (FP) are on opposite strands on opposite ends and the sequencing primer (SP) sits adjacent to the USS (with the 4-bp tag at the beginning). I am using plasmid sequence present in the lab’s other USS constructs for the Gaps (1 and 2).

To sequence the 200mer (either before or after recovery from competent cell periplasms), the DNA would be melted and annealed to an Illumina flow cell. Below are shown two different parts of a flow cell surface, where the two different strands of a single molecule might anneal.
DNA synthesis from FP1 or FP2 generates a covalently attached version of each strand.
The original molecule is melted off and washed out of the flow cell, and a special in situ PCR method generates clusters of single-strands covalently bound to the flow cell surface. In each cluster, the strands are oriented in both directions.
Sequencing then proceeds from SP binding sites. In this design, the SP binding site will then read the complement of the USS (with the first four fixed bases), so the actual sequence generated would be the USS contained in the construct.
There are several small details to go over to make sure that this design will work. Because the oligos are so expensive, and the degenerate oligo will be precious, I also plan to buy several non-degenerate oligos corresponding to perfect consensus, randomized, and mutant USSs. These will act as controls for the annealing/extension step that generates the uptake substrates and as controls for measuring saturation curves to optimize the appropriate DNA uptake conditions. I will also be able to do PCR to regenerate the control constructs, while I should probably avoid amplifying the degenerate USS construct for fears of strongly biasing the representation of different sequences.

NEXT UP: Uh oh… What about yields? Dimensional analysis…

APPENDIX:

The different parts of the two oligos:

Notes on my reverse engineering:
  1. FP1 (25 nt): Composed of putative 20mer FP1 + first 5 bases of one adaptor (calling it A)
  2. SP1 (33 nt): Sequencing primer for single-end Illumina runs. Includes the 13 bases of the normal adaptor that normally results in a 13 bp inverted repeat palindrome on either side of adapted DNA fragments.
  3. USS (36 nt): Includes 4-base tag (ATGC) upstream of a 32-base genomic Gibbs consensus sequence with a set level of degeneracy at each position.
  4. G1 (36 nt): Additional sequence from pGEM7f ,corresponding to the portion of the spacer region where the two oligos are intended to anneal.
  5. G2 (46 nt): More sequence from pGEM7f, corresponding to the spacer region only on one of the two oligo.
  6. FP2’ (23 nt): Composed of the complement to the 20mer FP2 + first 3 bases of the other adaptor (calling it B).
  7. Total length after annealing and extension is 200 bases, where the USS is located from position 63 (after the spacer) to position 94. In the flow cell, the use of SP1 as a sequencing primer should read the complement of the USS sequence, so the actual sequence obtained will correspond to USS (with the first four bases always ATGC).

3 comments: