Thursday, January 7, 2010

Using IGV to look at Illumina data

Out with the oughts in with the tens... Out with Maq, in with BWA!

Recap: I got a bunch of Illumina GA2 data from 5 DNA samples: Our recipient chromosome, Rd; our donor chromosome, NP; two transformed chromosomes; and a pool of four transformed chromosomes. I’d previously done a bunch of work with this data using the Maq alignment algorithm and SNP caller and hacked away at the SNPs in R.

Since then, I re-mapped our sequence reads using the BWA alignment algorithm, which was pretty much as easily installed and used (from the command line) as was Maq, though there are far more settings that might be manipulated. BWA has two big advantages over Maq:

1) It performs gapped alignment, so reads containing short indels can still be mapped to a reference sequence. This partially helped to overcome variation in read depth (coverage) due to mapping artifacts.
2) It outputs data in the SAM format, which is a newly-minted standard format for reference mappings of deep sequencing data. Thus I can use several downstream analysis tools that have been developed to work with SAM files and their binary equivalent BAM. Namely, the SAMtools package and the Broad Institute’s Integrated Genome Viewer (IGV) work with this format.

It still has one similar disadvantage as Maq, in that when a read maps to multiple locations in the genome, only one arbitrary mapping is kept, while all alternate mappings are discarded. This can create some odd artifacts at multicopy sequences in the genome. Largely this isn’t such a problem for me, but does cause some challenges in interpreting the correct base at every position.

To make a long story short, as expected, this re-analysis ended up yielding the same recombinant donor segments in our transformed chromosomes at a gross level, but I was also able to examine the genomes we sequenced in much greater detail by manually scanning the raw data in IGV.

To make the short story long...

I had an odd problem running IGV initially: To conserve memory, IGV only loads in a portion of the genome at a time. For large genomes with low coverage data, it works great with little RAM, so launching the application directly from the website is no problem. But because our dataset has such tremendous coverage of the genome, IGV was continuously crashing when I tried to run the 2Gb web version, and the lab computer only has 6Gb, so I couldn’t run the 10Gb web version.

So instead, I had to download the package and run it myself from the command-line. First I modified the relevant shell script for my computer called igv_mac-intel.sh by changing the switch -Xmx750m to -Xmx4800m using nano. This changed the memory used by IGV from 750 Mb to 4800 Mb. Then, after making the script executable (using chmod +x), I could run the script from the IGV directory by typing:
./igv_mac-intel.sh.

I did confront one other issue, which was that my FastA header was not identical to the name of the chromosome used in the SAM format file, so before loading the genome, I had to modify the FastA header from Genbank’s version to what I had used in naming the chromosome in the SAM file.

From there it was simple to load in a genome and then load in BAM files to look at the BWA alignments. (While this worked like a charm, I completely failed to add annotation tracks; IGV can apparently load GFF3 and BED formatted files, but mine wouldn’t load properly for whatever reason).

IGV’s visualization of read alignments really helped me understand the nature of this huge data set. I have now scanned through all the datasets against the Rd sequence reference and am part way through scanning through them all using the NP sequence reference. It’s been a rather cumbersome and slow chore, which I’ve spent the last three days doing, but it is also rewarding to look at the raw data, rather than something that I just have the computer spit out. I’ll get to that next.

I will use the rest of this post to go over some basic things about looking at this kind of data and will later show some interesting bits of our transformants.

Rd versus Rd

Here is what mapping our Rd sequence reads against the Rd reference genome looks like for a tiny segment of the genome. I picked a location that seems to have an “error-prone” region. There’s a whole slew of locations that look like this.
The top panel shows the portion of the Rd chromosome in view.

The next panel shows coverage, or “read depth” per position (so for this window, coverage ranges from ~800-1000 sequence reads mapped per position).

The three colored columns in the read depth panel are indicating potential sequence variants, based on a user-defined threshold percentage (I am using the early access version of IGV to have this control. I set it at 10%.) Just below the coverage plot is a running string of colored boxes, which indicate the bases in the Rd reference sequence.

So at the central base (position 904,862), the reference has a C (blue), whereas in our dataset, this position was covered 770 times, where 78% of reads were C, while ~21% were T (along with a few other stragglers).

In the lower panel is a pileup of the individual sequence reads. Grey indicates a base matching the reference, whereas colors indicate mismatches. As Ilumina sequencing is rather error-prone, as scattering of color is seen all over the place, but for most positions, the vast majority of reads match the reference base. The orientation of the read is also obvious in this view.

However, this is only showing a handful of the reads across this region. I can also compress the reads by right-clicking and selecting the “Collapse Track” option:
Now, the high degree of “errors” for several bases becomes evident. Other features also become apparent, namely the red and black labeled reads:

Red reads have an “orphaned” pair; i.e. the other read from the same molecule wasn’t mapped. This could be because the mate pair was a low-quality read, or could be because the mate pair was in sequence absent from the Rd genome.

Black reads have a paired read outside the user-defined maximum DNA fragment size (which I set at 300 bp). That is, the mate pair maps more than 300 bases away. As with most of the sequencing errors outside of the error-prone region, the red and black lines are fairly rare and evenly scattered. Manually checking where the black mate pairs were showed that most were just outside my threshold, so probably indicate no indel or rearrangement.

So what’s going on with the “error-prone” region? Because there’s not a lot of strangely mapping paired reads, I doubt this is a BWA artifact, but more likely is a sequencing artifact, possibly due to a bit of DNA that is “slippery” to the polymerase.

The interesting possibility that these are due to “clonal variation” (especially since the three variants shown in the coverage column are all ~25%) is unlikely in this instance, since there seem to be error-prone regions immediately adjacent to these three that just didn’t make my threshold.

NP versus Rd

The next image adds the NP sequencing data to the mix, still using Rd as the reference. Again, the error-prone bit shows up, further arguing that this isn’t due to clonal variation, since this is an independent culture and strain. However, on the left, several bona fide NP-specific SNPs were quite unambiguously identified.

The next image illustrates a couple things:
First, the region I previously showed had rather even coverage, whereas many, if not most, regions show much more variation in coverage. Here, coverage varies from ~200 to ~1800. In most instances, all DNA samples showed similar coverage for the same region, unless there is a structural variant.

Second, there is clearly a structural variant of some kind here between Rd and NP. The black discordant reads mostly map ~375 bp away, suggesting an ~75 bp insertion in Rd relative to NP. The red orphaned reads (mostly flanking the black reads) suggest that there is also an insertion in NP relative to Rd... so an insertional deletion then.

Okay, that’s it for now. Later, I’ll put up some more interesting stuff...
(continued...)