Text to fasta and other delights of the shell
One thing I’ve learned in my current job is that some familiarity with Linux tools for processing text files: awk, sed, grep, head/tail, cut/paste and so on, often provides a speedier solution than writing a script in (insert scripting language of choice here). I know this stuff is trivial to shell gurus, but I still get a little buzz out of it. A couple of real-life examples.
- Delimited text to fasta
- Delimited text to sorted files named by field
- By kinase: show the best substrates for a kinase
- By substrate: show the best kinases for a substrate
Your collaborator has developed “peptides on a chip” and sends you a text file describing said chip where column 1 = spot number, column 2 = peptide sequence:
1 ELDQGSLATSF 2 VDLAATPTDVR 3 TTNEEYLDLSQ 4 SSLDVYDGRFL 5 ALVHSYMTGRR
Your analysis requires that the peptide sequences be in fasta format. Easy, using awk:
awk '{print ">peptide" $1 "\n" $2}' infile > outfile
Result:
>peptide1 ELDQGSLATSF >peptide2 VDLAATPTDVR >peptide3 TTNEEYLDLSQ >peptide4 SSLDVYDGRFL >peptide5 ALVHSYMTGRR
Output from a program to predict phosphorylation sites for protein kinases contains 6 columns – kinase name, domain number, peptide name, residue position, sequence and score, like so:
kin2 1 pep2 7 RASTFAG 89.82 kin2 1 pep1 6 RRTSLGT 82.69 kin1 1 pep1 6 RAVSMDN 82.22 kin1 1 pep2 7 RASSDGE 80.06 kin2 1 pep1 6 RHSSYPA 80.09 kin1 1 pep2 7 RTTSFAE 82.27 kin2 1 pep1 6 RAASMDS 84.43 kin1 1 pep1 6 RRLTFRK 80.51 kin2 1 pep2 7 RASSADD 84.74 kin1 1 pep2 7 RRASVAA 84.59
You want to sort the output into separate files in two ways:
Awk to the rescue again, with sort along for the ride. Case 1 – best substrates for a kinase:
awk '{file = $1; command = "sort -nrk6 >>"; print $0 | command file}' infile
Result: two files named kin1 and kin2, showing substrates for the respective kinases, sorted by score. File kin1 looks like this:
kin1 1 pep2 7 RRASVAA 84.59 kin1 1 pep2 7 RTTSFAE 82.27 kin1 1 pep1 6 RAVSMDN 82.22 kin1 1 pep1 6 RRLTFRK 80.51 kin1 1 pep2 7 RASSDGE 80.06
Case 2 – best kinases for a substrate, is as simple as substituting $3 (substrate name) as the output file:
awk '{file = $3; command = "sort -nrk6 >>"; print $0 | command file}' infile
Result: two files named pep1 and pep2, showing kinases for the respective substrates, sorted as before. Here’s the contents of pep1:
kin2 1 pep1 6 RAASMDS 84.43 kin2 1 pep1 6 RRTSLGT 82.69 kin1 1 pep1 6 RAVSMDN 82.22 kin1 1 pep1 6 RRLTFRK 80.51 kin2 1 pep1 6 RHSSYPA 80.09



I should really have got to grips with the shell, but I was too in love with perl.
rpg
January 8, 2009 at 12:27 pm
Nothing wrong with that – Perl was invented as “a better shell”. It’s just interesting to me how much can be achieved in a one-line command as opposed to the “overhead” of a small script: open editor, write, save, test, run.
nsaunders
January 8, 2009 at 12:47 pm
I’m actually quite worried that if I dashed off a shell script I’d do lasting damage… I use BBEdit to code perl, which makes finding gotchas and checking syntax dead simple.
rpg
January 9, 2009 at 7:03 am
Yeah, developing a little bit of shell-fu has increased the speed at which I can get things done many-fold. One drawback, though, is reproducibility. When I get something working and try to wrap up the process into a nice makefile, trying to grep through my bash history to remember what I did can be frustrating. Even so, I’d never go back.
Chris
January 9, 2009 at 7:23 am
You must know how to write makefiles before, if you want to start using shell commands like this.
dalloliogm
January 13, 2009 at 1:51 am
I’ve seen people do smart things in bioinformatics using makefiles and indeed rake. I wouldn’t say they were a prerequisite though.
nsaunders
January 13, 2009 at 8:51 am
I’d better stop using the shell immediately, then, since I can’t write makefiles for love or money…
chris
January 13, 2009 at 1:23 pm