Monday, December 28, 2009

Vacation Post: Uncoiled!



Number 6...

...in the "Top 10 New Species of 2009"

Number 10 is pretty mind-boggling...
(continued...)

Friday, December 11, 2009

Update on problems with analysis

Ugh. Too much to blog about. The pace of computer work is totally different than with lab work (though in the end they seem equally labor-intensive), and I've made so many figures and gotten so many numbers in the past week, I barely know where to start...

Well, I guess I'll just blather about a couple problems I've been working through, but leave out all the charts and graphs and figures for the time being:

500-fold sequence coverage still "misses" parts of the genome:

We got a ton of data. For our controls, it was surely massive overkill. Nevertheless, "read depth" (how many read mappings cover a particular position) still varies by a substantial amount. There are numerous sources of this variation (%GC being one that is quite apparent), but I am most worried about variation in "read depth" due to the alignment algorithm I'm using to map reads to the genome.

As I try to root out artifacts in my discovery of putative recombination-associated mutations, I confront the fact that "read depth" is on average reduced when recombinant donor segments are mapped back to the recipient genome, so the novel mutations I found in these strains are on average supported by far fewer sequence reads than the average base... Most of them still look pretty solid to me (though there are several obvious artifacts), but I don't have a good rationale for whether or not to trust them.

I'm trying several things to work this out, namely by examinining the reciprocal mappings (using the donor chromosome as the reference).

So far, my analysis has a huge risk of false negatives:

Several problems here.

(a) Part of this problem and the last one is that I am using an alignment package that does not account for gaps (Maq). This means even a single nucleotide indel reduces "read depth" dramatically on either side (out to ~42 bases, or read length). See above.

(b) Another issue I'm facing with several of Maq's downstream outputs is that "read depth" is capped at 255. Presumably, they were conserving memory and only assigned a byte to this number. But what I haven't quite figured out is whether the SNP output (for example) is ignoring any possible SNPs where coverage exceeded 255. My cursory look at the more raw output (the "pileup") suggests this might well be the case. This could mean that I'm missing a lot, since the mean "read depth" per genome position in our datasets is ~500.

(c) Finally, I've been ignoring all Maq's self-SNP and "heterozygous" SNP calls in my downstream analysis using R. I presume that SNPs called in my mapping of the recipient genome to the complete recipient sequence are simply mutations between our wild-type Rd strain and the sequenced one. (As an aside, several hundred of the SNPs called by Maq were actually giving the correct base for an ambiguous base in the "complete" genome. I'd like to find a way to somehow revise the archived Rd sequence to get rid of all the ambiguous bases.) And I don't have a solid plan on how to deal with the "heterozygous" calls. Because the Maq assembly program can only have greater than or equal to two haplotypes, positions with mixed base signals are called heterozygotes. These is actually pretty cool and could reflect cool stuff like clonal variation, but largely these are probably due to multiply mapping reads and/or persistent sequencing errors.


Solutions: The solutions to these problems will initially largely be a matter of doing everything all again with different alignment software. My plan is to use the BWA aligner and SAMtools. BWA allows for gaps (so the "read depth" issue should be partially solved), and SAMtools not only keeps everything in a new agreed-upon standard format, but has several other tools, including what looks to be a better SNP caller (at least it has more modifiable settings). I would also like to try to do some de novo assembly, perhaps with Velvet, since we have such absurd coverage and a simple enough genome.

In the meantime, my R-fu has been improving, though I am convinced that I am missing some really basic principles that would make my code run a lot faster.
(continued...)

Wednesday, December 2, 2009

Mutagenic recombination?

Okay, this is pretty cool. I will probably discover that it’s just some artifact of the mapping, but digging into the transformant data some more reveals what appears to be a high number of mutations within recombined segments (alleles that have neither donor nor recipient identity)...

For this analysis, I used much more stringent criteria for calling SNPs, so that low quality SNP calls would not contaminate the result. This is particularly important here, since we might expect to get lower quality SNP calls in the recombinant segments, due to the relatively high divergence between the donor and recipient genomes and the limitations of current mapping algorithms.

For TfA, there were 802 unambiguous donor alleles and 19 high-quality novel alleles, while for TfB, there were 902 unambiguous donor alleles and 21 high-quality novel alleles.

The two plots below indicate the presence of unambiguous donor alleles in blue bars going to 1 (which defined the recombinant segments), and the presence of unambiguous mutant alleles in red bars going to -1. (Click to enlarge)

That looks pretty striking! Mutations are clearly clustered into the recombined segments!

A few of the "novel alleles" in the two genomes are shared. 4 of 6 are in the first overlapping donor segment, and the other two are outside the donor segments. It is still early to be too confident in this result, but still! It is very suggestive.

I’ve never really taken the supposed causal connection between recombination and mutation too seriously, since the evidence mostly seems correlative to me, but if this result holds up, I think it will be a uniquely clear-cut example of mutations induced by recombination.

Before being confident in the result, I need to map the data back to the donor genome, and cross-check the result. If that works, some simple PCR and traditional sequencing should readily confirm or refute this deep sequencing result.
(continued...)