Friday, December 11, 2009

Update on problems with analysis

Ugh. Too much to blog about. The pace of computer work is totally different than with lab work (though in the end they seem equally labor-intensive), and I've made so many figures and gotten so many numbers in the past week, I barely know where to start...

Well, I guess I'll just blather about a couple problems I've been working through, but leave out all the charts and graphs and figures for the time being:

500-fold sequence coverage still "misses" parts of the genome:

We got a ton of data. For our controls, it was surely massive overkill. Nevertheless, "read depth" (how many read mappings cover a particular position) still varies by a substantial amount. There are numerous sources of this variation (%GC being one that is quite apparent), but I am most worried about variation in "read depth" due to the alignment algorithm I'm using to map reads to the genome.

As I try to root out artifacts in my discovery of putative recombination-associated mutations, I confront the fact that "read depth" is on average reduced when recombinant donor segments are mapped back to the recipient genome, so the novel mutations I found in these strains are on average supported by far fewer sequence reads than the average base... Most of them still look pretty solid to me (though there are several obvious artifacts), but I don't have a good rationale for whether or not to trust them.

I'm trying several things to work this out, namely by examinining the reciprocal mappings (using the donor chromosome as the reference).

So far, my analysis has a huge risk of false negatives:

Several problems here.

(a) Part of this problem and the last one is that I am using an alignment package that does not account for gaps (Maq). This means even a single nucleotide indel reduces "read depth" dramatically on either side (out to ~42 bases, or read length). See above.

(b) Another issue I'm facing with several of Maq's downstream outputs is that "read depth" is capped at 255. Presumably, they were conserving memory and only assigned a byte to this number. But what I haven't quite figured out is whether the SNP output (for example) is ignoring any possible SNPs where coverage exceeded 255. My cursory look at the more raw output (the "pileup") suggests this might well be the case. This could mean that I'm missing a lot, since the mean "read depth" per genome position in our datasets is ~500.

(c) Finally, I've been ignoring all Maq's self-SNP and "heterozygous" SNP calls in my downstream analysis using R. I presume that SNPs called in my mapping of the recipient genome to the complete recipient sequence are simply mutations between our wild-type Rd strain and the sequenced one. (As an aside, several hundred of the SNPs called by Maq were actually giving the correct base for an ambiguous base in the "complete" genome. I'd like to find a way to somehow revise the archived Rd sequence to get rid of all the ambiguous bases.) And I don't have a solid plan on how to deal with the "heterozygous" calls. Because the Maq assembly program can only have greater than or equal to two haplotypes, positions with mixed base signals are called heterozygotes. These is actually pretty cool and could reflect cool stuff like clonal variation, but largely these are probably due to multiply mapping reads and/or persistent sequencing errors.

Solutions: The solutions to these problems will initially largely be a matter of doing everything all again with different alignment software. My plan is to use the BWA aligner and SAMtools. BWA allows for gaps (so the "read depth" issue should be partially solved), and SAMtools not only keeps everything in a new agreed-upon standard format, but has several other tools, including what looks to be a better SNP caller (at least it has more modifiable settings). I would also like to try to do some de novo assembly, perhaps with Velvet, since we have such absurd coverage and a simple enough genome.

In the meantime, my R-fu has been improving, though I am convinced that I am missing some really basic principles that would make my code run a lot faster.


  1. So we should look for wetware resources to help with these problems. Can we contact Maq to ask about the 255 cap? Are there local R experts? Are you in touch with the local bioinformatics people - I know there's an email list and a seminar/discussion group....

  2. And perhaps it's time to try to run up a big long-distance bill on the lab phone...