HOWTO: generate random sequences using EMBOSS and Perl

Once in a while, you need a “random” sequence.  For instance, you might perform a simple statistical analysis on a bunch of sequences from a database and decide that it looks interesting.  However, if the analysis looks the same when performed on a set of completely random sequences, it’s a fair bet that it isn’t interesting at all.

The EMBOSS package includes a couple of utilities called makeprotseq and makenucseq, for generating random protein or nucleotide sequences, respectively.

One drawback to these programs is that whilst you can generate multiple sequences using the -amount switch, they have to be the same length.  This is easily fixed by writing a small perl wrapper to the program and using Perl’s rand() function.

First, here’s how we run makeprotseq:

makeprotseq -amount 1 -length 100 -outseq stdout [-pepstatsfile myfile] -auto

This says:  make 1 sequence of length 100 residues, print it to the terminal (-outseq stdout) and don’t prompt me for any more options (-auto).  We can also supply the output from another EMBOSS program, pepstats.  For instance if we want our random sequences to resemble human proteins, we might download a fasta file containing all known human protein sequences from a database, run it through pepstats and supply that file to makeprotseq.

Now:  suppose we want to generate 10 000 sequences with lengths between 100 and 200.  Here’s a quick and dirty perl script to do that:

#!/usr/bin/perl -w
use strict;
use Getopt::Long;

my $numseq = 1;
my $min    = 0;
my $max    = 0;

GetOptions('numseq=i' => \$numseq, 'min=i' => \$min, 'max=i' => \$max);

if($min == 0 or $max == 0 or $max < $min) {
    die("Usage = --numseq X --min Y --max Z > outfile\n\twhere num. sequences X >= 1, length Y < Z\n");
$max += 1;

for(my $i = 1; $i <= $numseq; $i++) {
    my $length = $min + int(rand($max - $min));
    system("makeprotseq -amount 1 -length $length -outseq stdout -auto");

In this example we’re using the Getopt::Long module to read 3 options from the command line:  –numseq, the number of sequences to create; –min, the minimum sequence length and –max, the maximum sequence length.  If $min = 0, $max = 0 or $max is less than $min, the script dies with a message explaining the correct usage.

Next, we add 1 to $max.  Why?  This relates to the rand() function.  In Perl, rand() with no qualifiers generates a random number between 0 and 1.  If you use rand EXPR, you get a number between 0 and less than EXPR.  The maximum value of rand(200), for example, is 199.9999….  So in our script we add 1 and now $max = 201, the maximum value of rand($max) is 200.9999…. and we pass that to int() to return the integer part.  If $min were 100, we can have $length from 100 (100 + 0) to 200 (100 + (200-100)).

The number of sequences generated is taken care of using a for() loop with $numseq as the index:  remember that makeprotseq can only create multiple sequences with identical length, so we just ask it to make one sequence $numseq times, each with a different $length.  In the loop we run a system() call to makeprotseq, passing $length to the -length switch and printing the sequence(s) to the terminal or to a file using “> outfile”.

There’s one drawback:  all of the IDs in our fasta sequence file will be the same (>EMBOSS_001, by default).  For many analyses this doesn’t matter.  However, if your downstream analyses require unique sequence IDs you could easily fix this up by capturing the fasta output as a Bioperl Bio::Seq object and re-writing the ID using a loop.  I’ll leave that as “an exercise for the reader”.

3 thoughts on “HOWTO: generate random sequences using EMBOSS and Perl

  1. Here’s a variation using bioperl-run and Bio::Factory::EMBOSS which doesn’t use files (add use statements as necessary):

    Erg! No edit button! Okay, with code tags:

    $max += 1;
    # same as Neil's to above line; new stuff below
    use Bio::Factory::EMBOSS;
    use Bio::SeqIO;
    # get an EMBOSS factory - bioperl-run
    my $f = Bio::Factory::EMBOSS->new();
    my $out = Bio::SeqIO->new(-format => 'fasta',
    -fh => \*STDOUT);
    # get an EMBOSS application object from the factory
    my $prog = $f->program('makeprotseq');
    for(my $i = 1; $i run({-amount => 1,
    -length => $length,
    -outseq => 'raw::stdout',});
    $seqstr =~ s{\s}{}g; # get rid of any newlines/spaces
    my $seqobj = Bio::PrimarySeq->new(-alphabet => 'protein',
    -seq => $seqstr,
    -display_id => "seq_$i");

  2. Neil if someone is connected to internet when generating the sequence, I would recommend them to visit and register [ ]. This service allows to generate true-random-numbers:
    “QRBG is a fast, nondeterministic and novel random number generator whose
    randomness relies on intrinsic randomness of the quantum physical process of
    photonic emission in semiconductors and subsequent detection by the photo-
    electric e ect” [ ]
    Once they are registered, they can download the qrand [ extract the ] program and use it via the console to get a random sequence of numbers [they have to make changes in the qrbg.ini to reflect their username and password]. Now these number can go as array index for DNA [ Nucleotides ] or Protein [ Amino Acids ] array.

Comments are closed.