Thursday, October 15, 2009

Computing Bootcamp

Whew, I’ve really fallen behind on my blogging... Last week, a good friend of mine came into town for a “northern retreat”, in which he hoped to get work done on a paper. Instead, he and I drank enormous amounts of beer and did an enormous amount of computing with the Haemophilus infuenzae genome (at least by my standards). While the beer probably didn’t help anything, the computing did.

I’ll go over some of what we did in future posts, but right here I just want to outline some of the computing lessons I learned looking over his shoulder over the week. Many of these lessons have been given to me before and are likely quite basic for the real computationalists out there, but somehow I’ve emerged from the computing emersion with a lot more competence and confidence than I had before...



Here's three useful things I'm getting better at:

(1) Avoid using the mouse. The more that can be accomplished from the command line and using keystrokes, the better. From the command line, tab-completion and cursor-control of the command history make issuing commands far more efficient. The coolest keystroke I’ve now picked up the habit of in the Mac Leopard OS is Cmd-Tab, which takes you to the last active open application (and repeated Cmd-Tabs cycle through the open applications in order of their previous usage). This is perfect for toggling between the command-line and a text-editor where one can keep track of what one is doing.

(2) Toggle between the command-line and a text-editor constantly. Rather than trying to write a whole script and then run it from the command-line, it was far easier and faster to simply try commands out, sending them to the standard output, and cobble together the script line-by-line, adding the working commands to a text document. This has three useful effects: (1) Bugs get worked out before they even go into a script, (2) It forces one to document one’s work, as in a lab notebook. This also ended up being quite useful for my lab meeting this week, in which I decided to illustrate some stuff directly from the terminal. (3) It is forcing me to work “properly”, that is sticking with UNIX commands as much as possible.

(3) Learn how the computer is indexing your data. This point is probably the most important, but also the one that is taking me the most effort. I’ll illustrate with an example (which I’ll get into in more scientific detail later):

The output of one of our scripts was a giant 3 column X 1.8 million row table. I wanted to look at a subset of this huge table, in which the values in some of the cells exceeded some threshold. At first I was doing this (in R) by writing fairly complicated loops, which would go through each line in the file, see if any cells fit my criteria, and then return a new file that only including those rows I was interested in. When I’d run the loop, it would take several minutes for finish. And writing the loop was somewhat cumbersome.

But the extremely valuable thing I learned was that R already had all the data in RAM indexed in a very specific way. Built-in functions (which are extremely fast) allowed me to access a subset of the data using a single simple line of code. Not only did this work dramatically faster, but was much more intuitive to write down. Furthermore, it made it possible for me to index the large dataset in several different ways and instantly call up whichever subset I wanted to plot or whatnot. I ended up with a much leaner and straightforward way of analyzing the giant table and I didn’t need to make a bunch of intermediary files or keep track of as many variables.

Next time, I’ll try and flesh out some of the details what I was doing...

1 comment:

  1. Posts are getting very infrequent.... How about trying for short daily posts instead of long ambitious essays?

    ReplyDelete