What You’re Doing Is Rather Desperate

Notes from the life of a bioinformatics researcher

Text to fasta and other delights of the shell

One thing I’ve learned in my current job is that some familiarity with Linux tools for processing text files: awk, sed, grep, head/tail, cut/paste and so on, often provides a speedier solution than writing a script in (insert scripting language of choice here). I know this stuff is trivial to shell gurus, but I still get a little buzz out of it. A couple of real-life examples.

  1. Delimited text to fasta
  2. Your collaborator has developed “peptides on a chip” and sends you a text file describing said chip where column 1 = spot number, column 2 = peptide sequence:

    1	ELDQGSLATSF
    2	VDLAATPTDVR
    3	TTNEEYLDLSQ
    4	SSLDVYDGRFL
    5	ALVHSYMTGRR
    
    

    Your analysis requires that the peptide sequences be in fasta format. Easy, using awk:

    
    awk '{print ">peptide" $1 "\n" $2}' infile > outfile
    

    Result:

    >peptide1
    ELDQGSLATSF
    >peptide2
    VDLAATPTDVR
    >peptide3
    TTNEEYLDLSQ
    >peptide4
    SSLDVYDGRFL
    >peptide5
    ALVHSYMTGRR
    
    
  3. Delimited text to sorted files named by field
  4. Output from a program to predict phosphorylation sites for protein kinases contains 6 columns – kinase name, domain number, peptide name, residue position, sequence and score, like so:

    kin2	1	pep2	7	RASTFAG	89.82
    kin2	1	pep1	6	RRTSLGT	82.69
    kin1	1	pep1	6	RAVSMDN	82.22
    kin1	1	pep2	7	RASSDGE	80.06
    kin2	1	pep1	6	RHSSYPA	80.09
    kin1	1	pep2	7	RTTSFAE	82.27
    kin2	1	pep1	6	RAASMDS	84.43
    kin1	1	pep1	6	RRLTFRK	80.51
    kin2	1	pep2	7	RASSADD	84.74
    kin1	1	pep2	7	RRASVAA	84.59
    
    

    You want to sort the output into separate files in two ways:

    • By kinase: show the best substrates for a kinase
    • By substrate: show the best kinases for a substrate

    Awk to the rescue again, with sort along for the ride. Case 1 – best substrates for a kinase:

    
    awk '{file = $1; command = "sort -nrk6 >>"; print $0 | command file}' infile
    

    Result: two files named kin1 and kin2, showing substrates for the respective kinases, sorted by score. File kin1 looks like this:

    kin1	1	pep2	7	RRASVAA	84.59
    kin1	1	pep2	7	RTTSFAE	82.27
    kin1	1	pep1	6	RAVSMDN	82.22
    kin1	1	pep1	6	RRLTFRK	80.51
    kin1	1	pep2	7	RASSDGE	80.06
    
    

    Case 2 – best kinases for a substrate, is as simple as substituting $3 (substrate name) as the output file:

    
    awk '{file = $3; command = "sort -nrk6 >>"; print $0 | command file}' infile
    

    Result: two files named pep1 and pep2, showing kinases for the respective substrates, sorted as before. Here’s the contents of pep1:

    kin2	1	pep1	6	RAASMDS	84.43
    kin2	1	pep1	6	RRTSLGT	82.69
    kin1	1	pep1	6	RAVSMDN	82.22
    kin1	1	pep1	6	RRLTFRK	80.51
    kin2	1	pep1	6	RHSSYPA	80.09
    

Written by nsaunders

January 7, 2009 at 1:00 pm

7 Responses

Subscribe to comments with RSS.

  1. I should really have got to grips with the shell, but I was too in love with perl.

    rpg

    January 8, 2009 at 12:27 pm

  2. Nothing wrong with that – Perl was invented as “a better shell”. It’s just interesting to me how much can be achieved in a one-line command as opposed to the “overhead” of a small script: open editor, write, save, test, run.

    nsaunders

    January 8, 2009 at 12:47 pm

  3. I’m actually quite worried that if I dashed off a shell script I’d do lasting damage… I use BBEdit to code perl, which makes finding gotchas and checking syntax dead simple.

    rpg

    January 9, 2009 at 7:03 am

  4. Yeah, developing a little bit of shell-fu has increased the speed at which I can get things done many-fold. One drawback, though, is reproducibility. When I get something working and try to wrap up the process into a nice makefile, trying to grep through my bash history to remember what I did can be frustrating. Even so, I’d never go back.

    Chris

    January 9, 2009 at 7:23 am

  5. You must know how to write makefiles before, if you want to start using shell commands like this.

    dalloliogm

    January 13, 2009 at 1:51 am

  6. I’ve seen people do smart things in bioinformatics using makefiles and indeed rake. I wouldn’t say they were a prerequisite though.

    nsaunders

    January 13, 2009 at 8:51 am

  7. I’d better stop using the shell immediately, then, since I can’t write makefiles for love or money…

    chris

    January 13, 2009 at 1:23 pm


Comments are closed.