Tuesday, August 17, 2010

Direct uptake specificity measurements

So I got a bit distracted from our plans with the Genome BC dough, since I got some more sequencing data! Woo! I'll return to the design of our 92 clones worth of sequencing later. For now, I am trying to work through some more grant writing, incorporating our findings from the new sequence data.

Below, I'll (try to) quickly summarize what I've found with my (very) preliminary data analysis and then try to describe the thing I'm having a hard time writing clearly about...

First off: It worked! My scheming panned out quite nicely. I got ~15 million single-end sequence reads from each of two DNA pools. One was from the input 24% degenerate USS-containing fragment. The other was from the periplasm-purified DNA fragments. The experiment and its outcome were described here and here.

Filtering the data:

The first thing I had to do was somehow parse the raw data file. For my first pass, I used a quick and dirty, but probably quite stringent, quality filter:
  1. I excluded any read where the four first and six last bases were not exactly what they should be. (These sequenced positions were supposed to be non-degenerate, so should have zero mismatches from the expected input sequence.) This excludes reads with known "bad" bases and forces the alignment to be correct. The unfiltered data contained a bunch of reads that clearly had insertions or deletions within the USS, probably due to the limitations of the original oligo synthesis.
  2. I excluded all reads that contained any Ns. This makes the analysis a bit simpler, especially when calculating the number of mismatches in the degenerate USS from consensus.
This left me with ~10 million reads in each dataset. That's a lot of loss, and I certainly will try and return to those sequence reads, but for now, leaving them out is a decent filter that will simplify the analysis (mainly by forcing a "true" alignment).


I piped a few greps together to pull all the sequence reads matching my criteria (the first four bases ATGC, the last six bases GGTCGA, and no Ns). From a UNIX terminal:
grep ^ATGC sequence.txt | grep GGTCGA$ | grep -v N > newfile.txt
That's it! (Someday, I should probably do something with all the quality scores, but for now, I'm ignoring them. This operation also lost all the read IDs.)

I soon found that when analyzing these new files that my poor computer-foo is making very slow-running scripts, so I took the first 100,000 reads from each filtered dataset onwards with a simple:
head -n 100000 newfile.txt
Making a Position Weight matrix:

I next calculated the frequency of each base at each position for the two datasets (generating two position weight matrices, or PWMs). Luckily, I have previously worked out how to use the input PWM to background-correct the periplasm-purified PWM here and here.

Details: On to some R...

I read in the data with scan. I've usually used thing like read.table or read.csv, but scan looks like the way to go with a big giant vector:
sequences <- scan("newfile.txt", what="character" )
Next, I calculated the per position base frequencies:
basesread <- 42
newmatrix <- matrix(, basesread, 4)

for (i in 1:basesread) { newmatrix[i, ] <-
table(substr(sequences, i, i))

Hmmm... Would've been nice to avoid that loop...

From the two matrices, I got a background-corrected periplasmic uptake specificity motif that looks like this:
This excludes the fixed bases and one other anomalous base at the beginning of the read.
(Sorry I can't use Weblogo or other things that make nicer logos, as none accomodate position-specific base composition.)

Hey, look at that C at position 7! It just sticks way out! That must be a particularly important residue for uptake. A previous postdoc and graduate student both had measured uptake efficiency for fragments with point mutations in a consensus USS. This data offers a great independent validation of this basic result. Here's what that data looks like:
It looks like the two approaches are giving comparable results. Fabulous!

The next step:

So what else can we learn from this much larger deep sequencing dataset? I'm sure all sorts of things. For now, the only other thing I've done is calculate the number of mismatches between each of the 100,000 filtered sequence reads and the consensus USS the degenerate construct was based on.

Here's what the distribution of mismatch number looks like for the two datasets:

Awesome! Shifted, just as previously simulated!


At first I tried a bunch of extremely laborious and slow ways of doing this. Then I found a package called "cba" with a function called "sdists", which can do edit distances with vectors of character strings. (The sdists step is how I murdered the computers when I tried using the full dataset.)

consensus <- "ATGCCAAAGTGCGGTTAATTTTTACAGTATTTTTGGGTTCGA" distances <- as.vector(sdists(consensus, sequences, weight=c(1,0,1) ) )

I guess I could just do it in chunks for the whole dataset to save on RAM...

Anyways, this vector of distances is going to serve as an important index to the sequence reads for the next set of analyses (which will look at co-variation between positions and how specific mismatches affect the motif).


Okay, now to try and describe the thing I'm having trouble saying in a succinct way. These information content measurements, as in the above motif, are in bits. This is "information content", but can also be thought of as "surprise". This is why the USS motif derived from the genome has so much higher information content than the new experimentally determined uptake motif. (Here's the genomic USS:)
Because most of the bases at a given position in the degenerate USS pool were already the preferred consensus base, little "surprise" was elicited when the preferred base was enriched in the periplasm-purified pool. As measure in bits. That's why the scale is so different between the two motifs.

Anyways, the issue arises when thinking about some other possible experiments. For example, if uptake specificity were partially modified (as we think we can accomplish using engineered strains), the positions with a new specificity would become extremely informative relative to the other when using the normal degenerate fragments. But if a new construct were designed based on the altered specificity, these positions would no longer have such elevated information content relative to the other bases...

This isn't actually a bad thing--in fact we can exploit it--it's just something hard to describe clearly...

1 comment:

  1. So for the strains carrying a cross-complementing plasmid, testing them with the degenerate USS of the host will give an extremely sensitive measure of changed specificity.

    Based on the results you have, do you think we'd get useful information from giving cells a much more degenerate (or even completely degenerate) USS? I know I originally thought higher degeneracy would be bad, but I'm beginning to see the light.