Once in a while, you need a “random” sequence. For instance, you might perform a simple statistical analysis on a bunch of sequences from a database and decide that it looks interesting. However, if the analysis looks the same when performed on a set of completely random sequences, it’s a fair bet that it isn’t interesting at all.
The EMBOSS package includes a couple of utilities called makeprotseq and makenucseq, for generating random protein or nucleotide sequences, respectively.
One drawback to these programs is that whilst you can generate multiple sequences using the -amount switch, they have to be the same length. This is easily fixed by writing a small perl wrapper to the program and using Perl’s rand() function.
First, here’s how we run makeprotseq:
makeprotseq -amount 1 -length 100 -outseq stdout [-pepstatsfile myfile] -auto
This says: make 1 sequence of length 100 residues, print it to the terminal (-outseq stdout) and don’t prompt me for any more options (-auto). We can also supply the output from another EMBOSS program, pepstats. For instance if we want our random sequences to resemble human proteins, we might download a fasta file containing all known human protein sequences from a database, run it through pepstats and supply that file to makeprotseq.
Now: suppose we want to generate 10 000 sequences with lengths between 100 and 200. Here’s a quick and dirty perl script to do that:
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
my $numseq = 1;
my $min = 0;
my $max = 0;
GetOptions('numseq=i' => \$numseq, 'min=i' => \$min, 'max=i' => \$max);
if($min == 0 or $max == 0 or $max < $min) {
die("Usage = randseq.pl --numseq X --min Y --max Z > outfile\n\twhere num. sequences X >= 1, length Y < Z\n");
}
$max += 1;
for(my $i = 1; $i <= $numseq; $i++) {
my $length = $min + int(rand($max - $min));
system("makeprotseq -amount 1 -length $length -outseq stdout -auto");
}
In this example we’re using the Getopt::Long module to read 3 options from the command line: –numseq, the number of sequences to create; –min, the minimum sequence length and –max, the maximum sequence length. If $min = 0, $max = 0 or $max is less than $min, the script dies with a message explaining the correct usage.
Next, we add 1 to $max. Why? This relates to the rand() function. In Perl, rand() with no qualifiers generates a random number between 0 and 1. If you use rand EXPR, you get a number between 0 and less than EXPR. The maximum value of rand(200), for example, is 199.9999…. So in our script we add 1 and now $max = 201, the maximum value of rand($max) is 200.9999…. and we pass that to int() to return the integer part. If $min were 100, we can have $length from 100 (100 + 0) to 200 (100 + (200-100)).
The number of sequences generated is taken care of using a for() loop with $numseq as the index: remember that makeprotseq can only create multiple sequences with identical length, so we just ask it to make one sequence $numseq times, each with a different $length. In the loop we run a system() call to makeprotseq, passing $length to the -length switch and printing the sequence(s) to the terminal or to a file using “> outfile”.
There’s one drawback: all of the IDs in our fasta sequence file will be the same (>EMBOSS_001, by default). For many analyses this doesn’t matter. However, if your downstream analyses require unique sequence IDs you could easily fix this up by capturing the fasta output as a Bioperl Bio::Seq object and re-writing the ID using a loop. I’ll leave that as “an exercise for the reader”.


