One thing I’ve learned in my current job is that some familiarity with Linux tools for processing text files: awk, sed, grep, head/tail, cut/paste and so on, often provides a speedier solution than writing a script in (insert scripting language of choice here). I know this stuff is trivial to shell gurus, but I still get a little buzz out of it. A couple of real-life examples.
- Delimited text to fasta
- Delimited text to sorted files named by field
- By kinase: show the best substrates for a kinase
- By substrate: show the best kinases for a substrate
Your collaborator has developed “peptides on a chip” and sends you a text file describing said chip where column 1 = spot number, column 2 = peptide sequence:
1 ELDQGSLATSF 2 VDLAATPTDVR 3 TTNEEYLDLSQ 4 SSLDVYDGRFL 5 ALVHSYMTGRR
Your analysis requires that the peptide sequences be in fasta format. Easy, using awk:
awk '{print ">peptide" $1 "\n" $2}' infile > outfile
Result:
>peptide1 ELDQGSLATSF >peptide2 VDLAATPTDVR >peptide3 TTNEEYLDLSQ >peptide4 SSLDVYDGRFL >peptide5 ALVHSYMTGRR
Output from a program to predict phosphorylation sites for protein kinases contains 6 columns – kinase name, domain number, peptide name, residue position, sequence and score, like so:
kin2 1 pep2 7 RASTFAG 89.82 kin2 1 pep1 6 RRTSLGT 82.69 kin1 1 pep1 6 RAVSMDN 82.22 kin1 1 pep2 7 RASSDGE 80.06 kin2 1 pep1 6 RHSSYPA 80.09 kin1 1 pep2 7 RTTSFAE 82.27 kin2 1 pep1 6 RAASMDS 84.43 kin1 1 pep1 6 RRLTFRK 80.51 kin2 1 pep2 7 RASSADD 84.74 kin1 1 pep2 7 RRASVAA 84.59
You want to sort the output into separate files in two ways:
Awk to the rescue again, with sort along for the ride. Case 1 – best substrates for a kinase:
awk '{file = $1; command = "sort -nrk6 >>"; print $0 | command file}' infile
Result: two files named kin1 and kin2, showing substrates for the respective kinases, sorted by score. File kin1 looks like this:
kin1 1 pep2 7 RRASVAA 84.59 kin1 1 pep2 7 RTTSFAE 82.27 kin1 1 pep1 6 RAVSMDN 82.22 kin1 1 pep1 6 RRLTFRK 80.51 kin1 1 pep2 7 RASSDGE 80.06
Case 2 – best kinases for a substrate, is as simple as substituting $3 (substrate name) as the output file:
awk '{file = $3; command = "sort -nrk6 >>"; print $0 | command file}' infile
Result: two files named pep1 and pep2, showing kinases for the respective substrates, sorted as before. Here’s the contents of pep1:
kin2 1 pep1 6 RAASMDS 84.43 kin2 1 pep1 6 RRTSLGT 82.69 kin1 1 pep1 6 RAVSMDN 82.22 kin1 1 pep1 6 RRLTFRK 80.51 kin2 1 pep1 6 RHSSYPA 80.09
I should really have got to grips with the shell, but I was too in love with perl.
Nothing wrong with that – Perl was invented as “a better shell”. It’s just interesting to me how much can be achieved in a one-line command as opposed to the “overhead” of a small script: open editor, write, save, test, run.
I’m actually quite worried that if I dashed off a shell script I’d do lasting damage… I use BBEdit to code perl, which makes finding gotchas and checking syntax dead simple.
Yeah, developing a little bit of shell-fu has increased the speed at which I can get things done many-fold. One drawback, though, is reproducibility. When I get something working and try to wrap up the process into a nice makefile, trying to grep through my bash history to remember what I did can be frustrating. Even so, I’d never go back.
You must know how to write makefiles before, if you want to start using shell commands like this.
I’ve seen people do smart things in bioinformatics using makefiles and indeed rake. I wouldn’t say they were a prerequisite though.
I’d better stop using the shell immediately, then, since I can’t write makefiles for love or money…