Text to fasta and other delights of the shell

One thing I’ve learned in my current job is that some familiarity with Linux tools for processing text files: awk, sed, grep, head/tail, cut/paste and so on, often provides a speedier solution than writing a script in (insert scripting language of choice here). I know this stuff is trivial to shell gurus, but I still get a little buzz out of it. A couple of real-life examples.

  1. Delimited text to fasta
  2. Your collaborator has developed “peptides on a chip” and sends you a text file describing said chip where column 1 = spot number, column 2 = peptide sequence:

    1	ELDQGSLATSF
    2	VDLAATPTDVR
    3	TTNEEYLDLSQ
    4	SSLDVYDGRFL
    5	ALVHSYMTGRR
    

    Your analysis requires that the peptide sequences be in fasta format. Easy, using awk:

    awk '{print ">peptide" $1 "\n" $2}' infile > outfile
    

    Result:

    >peptide1
    ELDQGSLATSF
    >peptide2
    VDLAATPTDVR
    >peptide3
    TTNEEYLDLSQ
    >peptide4
    SSLDVYDGRFL
    >peptide5
    ALVHSYMTGRR
    
  3. Delimited text to sorted files named by field
  4. Output from a program to predict phosphorylation sites for protein kinases contains 6 columns – kinase name, domain number, peptide name, residue position, sequence and score, like so:

    kin2	1	pep2	7	RASTFAG	89.82
    kin2	1	pep1	6	RRTSLGT	82.69
    kin1	1	pep1	6	RAVSMDN	82.22
    kin1	1	pep2	7	RASSDGE	80.06
    kin2	1	pep1	6	RHSSYPA	80.09
    kin1	1	pep2	7	RTTSFAE	82.27
    kin2	1	pep1	6	RAASMDS	84.43
    kin1	1	pep1	6	RRLTFRK	80.51
    kin2	1	pep2	7	RASSADD	84.74
    kin1	1	pep2	7	RRASVAA	84.59
    

    You want to sort the output into separate files in two ways:

    • By kinase: show the best substrates for a kinase
    • By substrate: show the best kinases for a substrate

    Awk to the rescue again, with sort along for the ride. Case 1 – best substrates for a kinase:

    awk '{file = $1; command = "sort -nrk6 >>"; print $0 | command file}' infile
    

    Result: two files named kin1 and kin2, showing substrates for the respective kinases, sorted by score. File kin1 looks like this:

    kin1	1	pep2	7	RRASVAA	84.59
    kin1	1	pep2	7	RTTSFAE	82.27
    kin1	1	pep1	6	RAVSMDN	82.22
    kin1	1	pep1	6	RRLTFRK	80.51
    kin1	1	pep2	7	RASSDGE	80.06
    

    Case 2 – best kinases for a substrate, is as simple as substituting $3 (substrate name) as the output file:

    awk '{file = $3; command = "sort -nrk6 >>"; print $0 | command file}' infile
    

    Result: two files named pep1 and pep2, showing kinases for the respective substrates, sorted as before. Here’s the contents of pep1:

    kin2	1	pep1	6	RAASMDS	84.43
    kin2	1	pep1	6	RRTSLGT	82.69
    kin1	1	pep1	6	RAVSMDN	82.22
    kin1	1	pep1	6	RRLTFRK	80.51
    kin2	1	pep1	6	RHSSYPA	80.09
    

7 thoughts on “Text to fasta and other delights of the shell

  1. nsaunders Post author

    Nothing wrong with that – Perl was invented as “a better shell”. It’s just interesting to me how much can be achieved in a one-line command as opposed to the “overhead” of a small script: open editor, write, save, test, run.

  2. rpg

    I’m actually quite worried that if I dashed off a shell script I’d do lasting damage… I use BBEdit to code perl, which makes finding gotchas and checking syntax dead simple.

  3. Chris

    Yeah, developing a little bit of shell-fu has increased the speed at which I can get things done many-fold. One drawback, though, is reproducibility. When I get something working and try to wrap up the process into a nice makefile, trying to grep through my bash history to remember what I did can be frustrating. Even so, I’d never go back.

  4. nsaunders Post author

    I’ve seen people do smart things in bioinformatics using makefiles and indeed rake. I wouldn’t say they were a prerequisite though.

  5. chris

    I’d better stop using the shell immediately, then, since I can’t write makefiles for love or money…

Comments are closed.