Text to fasta and other delights of the shell

One thing I’ve learned in my current job is that some familiarity with Linux tools for processing text files: awk, sed, grep, head/tail, cut/paste and so on, often provides a speedier solution than writing a script in (insert scripting language of choice here). I know this stuff is trivial to shell gurus, but I still get a little buzz out of it. A couple of real-life examples.

  1. Delimited text to fasta
  2. Your collaborator has developed “peptides on a chip” and sends you a text file describing said chip where column 1 = spot number, column 2 = peptide sequence:

    1	ELDQGSLATSF
    2	VDLAATPTDVR
    3	TTNEEYLDLSQ
    4	SSLDVYDGRFL
    5	ALVHSYMTGRR
    

    Your analysis requires that the peptide sequences be in fasta format. Easy, using awk:

    awk '{print ">peptide" $1 "\n" $2}' infile > outfile
    

    Result:

    >peptide1
    ELDQGSLATSF
    >peptide2
    VDLAATPTDVR
    >peptide3
    TTNEEYLDLSQ
    >peptide4
    SSLDVYDGRFL
    >peptide5
    ALVHSYMTGRR
    
  3. Delimited text to sorted files named by field
  4. Output from a program to predict phosphorylation sites for protein kinases contains 6 columns – kinase name, domain number, peptide name, residue position, sequence and score, like so:

    kin2	1	pep2	7	RASTFAG	89.82
    kin2	1	pep1	6	RRTSLGT	82.69
    kin1	1	pep1	6	RAVSMDN	82.22
    kin1	1	pep2	7	RASSDGE	80.06
    kin2	1	pep1	6	RHSSYPA	80.09
    kin1	1	pep2	7	RTTSFAE	82.27
    kin2	1	pep1	6	RAASMDS	84.43
    kin1	1	pep1	6	RRLTFRK	80.51
    kin2	1	pep2	7	RASSADD	84.74
    kin1	1	pep2	7	RRASVAA	84.59
    

    You want to sort the output into separate files in two ways:

    • By kinase: show the best substrates for a kinase
    • By substrate: show the best kinases for a substrate

    Awk to the rescue again, with sort along for the ride. Case 1 – best substrates for a kinase:

    awk '{file = $1; command = "sort -nrk6 >>"; print $0 | command file}' infile
    

    Result: two files named kin1 and kin2, showing substrates for the respective kinases, sorted by score. File kin1 looks like this:

    kin1	1	pep2	7	RRASVAA	84.59
    kin1	1	pep2	7	RTTSFAE	82.27
    kin1	1	pep1	6	RAVSMDN	82.22
    kin1	1	pep1	6	RRLTFRK	80.51
    kin1	1	pep2	7	RASSDGE	80.06
    

    Case 2 – best kinases for a substrate, is as simple as substituting $3 (substrate name) as the output file:

    awk '{file = $3; command = "sort -nrk6 >>"; print $0 | command file}' infile
    

    Result: two files named pep1 and pep2, showing kinases for the respective substrates, sorted as before. Here’s the contents of pep1:

    kin2	1	pep1	6	RAASMDS	84.43
    kin2	1	pep1	6	RRTSLGT	82.69
    kin1	1	pep1	6	RAVSMDN	82.22
    kin1	1	pep1	6	RRLTFRK	80.51
    kin2	1	pep1	6	RHSSYPA	80.09
    
Tags: ,

7 Comments to “Text to fasta and other delights of the shell”

  1. I should really have got to grips with the shell, but I was too in love with perl.

  2. Nothing wrong with that – Perl was invented as “a better shell”. It’s just interesting to me how much can be achieved in a one-line command as opposed to the “overhead” of a small script: open editor, write, save, test, run.

  3. I’m actually quite worried that if I dashed off a shell script I’d do lasting damage… I use BBEdit to code perl, which makes finding gotchas and checking syntax dead simple.

  4. Yeah, developing a little bit of shell-fu has increased the speed at which I can get things done many-fold. One drawback, though, is reproducibility. When I get something working and try to wrap up the process into a nice makefile, trying to grep through my bash history to remember what I did can be frustrating. Even so, I’d never go back.

  5. You must know how to write makefiles before, if you want to start using shell commands like this.

  6. I’ve seen people do smart things in bioinformatics using makefiles and indeed rake. I wouldn’t say they were a prerequisite though.

  7. I’d better stop using the shell immediately, then, since I can’t write makefiles for love or money…

Follow

Get every new post delivered to your Inbox.

Join 2,204 other followers

%d bloggers like this: