Open source = choice = workflow

Once you find a computational environment that works for you, your work flows so well that you take the environment for granted and rarely give it a moments thought. That’s as things should be – ultimately your work is more important than how you achieve it. Once in a while though, as I’m typing away in a Linux terminal, I like to pause and reflect on how the wonderful world of open-source makes my working day easier and even more fun than it otherwise M$ight be.

One of my current projects uses some Perl to generate output that looks like this:

O00141 1 Q96PU5 308 42.19 O00141 1 Q96PU5 277 33.42 O14757 1 P30304 76* 89.62 O14757 1 P30304 293* 88.50 O14757 1 P30304 507* 86.99 O14757 1 P30304 178* 85.62 O14757 1 P30304 124* 82.41 O14757 1 P30304 192 78.51 O14757 1 P30304 268 78.26 O14757 1 P30304 249 77.43

Plain text, tab-delimited output is great. Those numbers in column 4 are positions in a protein sequence. You can think of the positions marked with a “*” as “positive” and the rest “negative”. The numbers in column 5 are scores associated with that position.

I’d like to sort this output into two files with scores for the positive and negative sites. How do I get the lines with a positive site?

grep "*" infile > pos_outfile

And the negative? The grep “-v” switch matches the inverse of what I ask for:

grep -v "*" infile > neg_outfile

Wait, there’s more. I can sort my output by score (column 5), from highest to lowest:

grep "*" infile | sort -nr -k5 > pos_outfile grep -v "*" infile | sort -nr -k5 > neg_outfile

With just one line, two new files for each class of site, sorted by score.
Why am I doing this? I want to see whether there’s a statistical difference in the distribution of scores for each type of site. The single command “R” takes me into the R statistical environment. I read in my files:

pos <- read.csv("pos_outfile", sep = "\t") neg <- read.csv("neg_outfile", sep = "\t")

My scores are in column 5 or [,5] in R notation. I can generate a quick boxplot:

> boxplot(pos[,5],neg[,5], range = 0, boxwex = 0.3, notch = TRUE, names = c("pos","neg"), col = "grey80")

boxplot.png
Take a look at the scores distribution (min, max, median and quartiles):

> fivenum(pos[,5]) [1] 41.76 72.51 77.83 86.90 99.01 > fivenum(neg[,5]) [1] 27.01 51.21 57.97 65.43 94.08

Or see how the distributions differ with a Mann-Whitney test:

> wilcox.test(pos[,5],neg[,5]) Wilcoxon rank sum test with continuity correction data: pos[, 5] and neg[, 5] W = 4643148, p-value < 2.2e-16

When I’m happy with my analyses in R, I can automate them with a script. If I so choose, I could even automate the entire process in any number of ways: shell scripts, perl wrappers etc., such that sequences go in one end, statistics come out the other and I take a coffee break inbetween.

All this happens through typing in one terminal. No need to load up massive, disk and RAM-guzzling software suites with fancy GUIs. It’s a learning curve for sure but once learned, you have it forever and the time spent pays itself back many-fold with time saved later.
There are literally thousands of tools like this. All available to use because a disparate group of enthusiasts around the world share the belief that software should be free, shared and available to all. So once in a while, take a break from your data analysis and give thanks to the people who make it all possible.

5 thoughts on “Open source = choice = workflow

  1. chris

    I occasionally have epiphanies like this. The truly jarring note, however, is when you’re forced to use a non-*nix box, as I was recently when I was having laptop issues. I had to use a shared windows box for a week, and it was, in a word, awful. The damned thing was useless! No data analysis tools, no text manipulation tools, no damned terminal. About the only saving grace was being able to download PuTTy and log in to a real computer.
    Then again, I suppose I could have done some sequence analysis in word, or some microarray stuff in Excel. Or not.

  2. kevin

    lolz tell me about it. I still can’t imagine how some of my colleagues do web development jus using a putty terminal and vim! (becos they are only familiar with winxp environment but none took the step to try a linux installation. I was equally handicapped when i was stuck with windows with all my customised shell scripts and CLI tools all missing..

  3. Pedro Beltrao

    I think that it is also a matter of getting used to it ;). I work in windows and have been for the past 5 years. I only need to install Perl and bioperl to get going usually. It is true that a lot of the work I do is in clusters (running linux) but i use the clusters mostly to speed it up not because I cannot use the tools on the windows machine. I am just more comfortable in windows so I keep sticking to it. With the rise of Google and all the free online tools available there is progressively less things that still make me prefer windows over linux.

  4. nsaunders Post author

    Yes, it will be interesting to see if online services make our choice of local OS/software irrelevant in a few years.

  5. Deepak

    As Pedro said, the rise of online services has made me miss Linux less these days than I used to a couple of years ago. I get irritated by Windows often, but since I would say that I spend more time in Firefox than in the OS it may not matter. However, the question I sometimes ask myself. If I was on OSX or on Ubuntu, would I try and evade the OS itself so often? If I was still coding, I would almost certainly not be on Windows.

Comments are closed.