Once you find a computational environment that works for you, your work flows so well that you take the environment for granted and rarely give it a moments thought. That’s as things should be – ultimately your work is more important than how you achieve it. Once in a while though, as I’m typing away in a Linux terminal, I like to pause and reflect on how the wonderful world of open-source makes my working day easier and even more fun than it otherwise M$ight be.
One of my current projects uses some Perl to generate output that looks like this:
O00141 1 Q96PU5 308 42.19 O00141 1 Q96PU5 277 33.42 O14757 1 P30304 76* 89.62 O14757 1 P30304 293* 88.50 O14757 1 P30304 507* 86.99 O14757 1 P30304 178* 85.62 O14757 1 P30304 124* 82.41 O14757 1 P30304 192 78.51 O14757 1 P30304 268 78.26 O14757 1 P30304 249 77.43
Plain text, tab-delimited output is great. Those numbers in column 4 are positions in a protein sequence. You can think of the positions marked with a “*” as “positive” and the rest “negative”. The numbers in column 5 are scores associated with that position.
I’d like to sort this output into two files with scores for the positive and negative sites. How do I get the lines with a positive site?
grep "*" infile > pos_outfile
And the negative? The grep “-v” switch matches the inverse of what I ask for:
grep -v "*" infile > neg_outfile
Wait, there’s more. I can sort my output by score (column 5), from highest to lowest:
grep "*" infile | sort -nr -k5 > pos_outfile grep -v "*" infile | sort -nr -k5 > neg_outfile
With just one line, two new files for each class of site, sorted by score.
Why am I doing this? I want to see whether there’s a statistical difference in the distribution of scores for each type of site. The single command “R” takes me into the R statistical environment. I read in my files:
pos <- read.csv("pos_outfile", sep = "\t") neg <- read.csv("neg_outfile", sep = "\t")
My scores are in column 5 or [,5] in R notation. I can generate a quick boxplot:> boxplot(pos[,5],neg[,5], range = 0, boxwex = 0.3, notch = TRUE, names = c("pos","neg"), col = "grey80")> fivenum(pos[,5])  41.76 72.51 77.83 86.90 99.01 > fivenum(neg[,5])  27.01 51.21 57.97 65.43 94.08
Or see how the distributions differ with a Mann-Whitney test:> wilcox.test(pos[,5],neg[,5]) Wilcoxon rank sum test with continuity correction data: pos[, 5] and neg[, 5] W = 4643148, p-value < 2.2e-16
When I’m happy with my analyses in R, I can automate them with a script. If I so choose, I could even automate the entire process in any number of ways: shell scripts, perl wrappers etc., such that sequences go in one end, statistics come out the other and I take a coffee break inbetween.
All this happens through typing in one terminal. No need to load up massive, disk and RAM-guzzling software suites with fancy GUIs. It’s a learning curve for sure but once learned, you have it forever and the time spent pays itself back many-fold with time saved later.
There are literally thousands of tools like this. All available to use because a disparate group of enthusiasts around the world share the belief that software should be free, shared and available to all. So once in a while, take a break from your data analysis and give thanks to the people who make it all possible.