One thing I’ve learned in my current job is that some familiarity with Linux tools for processing text files: awk, sed, grep, head/tail, cut/paste and so on, often provides a speedier solution than writing a script in (insert scripting language of choice here). I know this stuff is trivial to shell gurus, but I still get a little buzz out of it. A couple of real-life examples.
- Delimited text to fasta
- Delimited text to sorted files named by field
- By kinase: show the best substrates for a kinase
- By substrate: show the best kinases for a substrate
Your collaborator has developed “peptides on a chip” and sends you a text file describing said chip where column 1 = spot number, column 2 = peptide sequence:
1 ELDQGSLATSF 2 VDLAATPTDVR 3 TTNEEYLDLSQ 4 SSLDVYDGRFL 5 ALVHSYMTGRR
Your analysis requires that the peptide sequences be in fasta format. Easy, using awk:
awk '{print ">peptide" $1 "\n" $2}' infile > outfile
Result:
>peptide1 ELDQGSLATSF >peptide2 VDLAATPTDVR >peptide3 TTNEEYLDLSQ >peptide4 SSLDVYDGRFL >peptide5 ALVHSYMTGRR
Output from a program to predict phosphorylation sites for protein kinases contains 6 columns – kinase name, domain number, peptide name, residue position, sequence and score, like so:
kin2 1 pep2 7 RASTFAG 89.82 kin2 1 pep1 6 RRTSLGT 82.69 kin1 1 pep1 6 RAVSMDN 82.22 kin1 1 pep2 7 RASSDGE 80.06 kin2 1 pep1 6 RHSSYPA 80.09 kin1 1 pep2 7 RTTSFAE 82.27 kin2 1 pep1 6 RAASMDS 84.43 kin1 1 pep1 6 RRLTFRK 80.51 kin2 1 pep2 7 RASSADD 84.74 kin1 1 pep2 7 RRASVAA 84.59
You want to sort the output into separate files in two ways:
Awk to the rescue again, with sort along for the ride. Case 1 – best substrates for a kinase:
awk '{file = $1; command = "sort -nrk6 >>"; print $0 | command file}' infile
Result: two files named kin1 and kin2, showing substrates for the respective kinases, sorted by score. File kin1 looks like this:
kin1 1 pep2 7 RRASVAA 84.59 kin1 1 pep2 7 RTTSFAE 82.27 kin1 1 pep1 6 RAVSMDN 82.22 kin1 1 pep1 6 RRLTFRK 80.51 kin1 1 pep2 7 RASSDGE 80.06
Case 2 – best kinases for a substrate, is as simple as substituting $3 (substrate name) as the output file:
awk '{file = $3; command = "sort -nrk6 >>"; print $0 | command file}' infile
Result: two files named pep1 and pep2, showing kinases for the respective substrates, sorted as before. Here’s the contents of pep1:
kin2 1 pep1 6 RAASMDS 84.43 kin2 1 pep1 6 RRTSLGT 82.69 kin1 1 pep1 6 RAVSMDN 82.22 kin1 1 pep1 6 RRLTFRK 80.51 kin2 1 pep1 6 RHSSYPA 80.09


