Sliding windows with (Bio)perl

A sliding window of length N is a common algorithm in sequence analysis. Here’s how to do it the Bioperl way.

The mathematics
Let’s imagine that we have a sequence of length L. For instance if L=5, our sequence might be “ABCDE”. How many windows of size N can we make? Well, if N=1 we have 5 (A, B, C, D, E). N=2 gives 4 (AB, BC, CD and DE). N=3 gives 3 (ABC, BCD, CDE) and so on, up to N=5 which gives 1 (ABCDE). You might see a pattern there. The number of windows with size N for a sequence of length L is (L-N)+1.

The Bioperl
Our starting point then, is a sequence and its length. Bioperl handles sequence input using Bio::SeqIO and Bio::SeqI. Here’s how we read in a bunch of sequences from fasta format and get the sequence ID, sequence and length:

use strict;
use Bio::SeqIO;

my $infile = shift || die ("Usage = windows.pl <fasta file>\n");
my $seqio = Bio::SeqIO->new('-file' => $infile, '-format' => 'fasta');
while(my $seqobj = $seqio->next_seq) {
    my $id  = $seqobj->display_id;
#we don't actually need $seq
    my $seq = $seqobj->seq;
    my $len = $seqobj->length;
                                     }

The Perl
So far, so simple. Now let’s think about how to move a window of length N along our sequence. Our first window starts at position 1 and ends at position 1+N. We slide along one residue to position 2, with the window end at 2+N. We keep doing that as long as window start + N is <= to the sequence length. When window start + N = sequence length, we’ve reached the end of the sequence.
In Perl terms, what we need is a for() loop. The sequence length $len is fixed, the window size $winsize is fixed, so all that we increment is the window start position. Here’s how that looks for a window size of 7:

my $winsize = 7;
for(my $i = 1; $i <= $len-($winsize-1)); $i++) {
    my $window = $seqobj->subseq($i,$i+($winsize-1));
                                               }

A couple of things to note here. First, note that $len-6 inclusive is 7 residues (e.g. 10-6 is positions 4,5,6,7,8,9,10). Second, we use the subseq() method to get our window. The syntax is always $seqobj->subseq(start,end). Note that (start,end) can be equal – e.g. subseq(1,1) gives the first residue of a sequence.

Summary
Well that’s all there is to it. You can do whatever you like with each window – most likely you’ll be calculating a score of some kind. Here’s a summary script, window.pl, showing use of the Getopt module to parse command line options:

use strict;
use Bio::SeqIO;
use Getopt::Long;

my $infile = "";
my $winsize = 7;  #default
GetOptions('infile=s' => \$infile, 'winsize=i' => \$winsize);
die("Usage = window.pl --infile  <fasta file> [--winsize <int>]\n") if(!$infile);

my $seqio = Bio::SeqIO->new('-file' => $infile, '-format' => fasta);
while(my $seqobj = $seqio->next_seq) {
    my $id  = $seqobj->display_id;
    my $len = $seqobj->length;

for(my $i = 1; $i <= $len-($winsize-1)); $i++) {
    my $window = $seqobj->subseq($i,$i+($winsize-1));
# do clever stuff with window
                                               }
                                     }

6 thoughts on “Sliding windows with (Bio)perl

  1. RPM

    A sliding window of length N is a common algorithm in sequence analysis.

    Ziheng Yang (he of PAML fame) gave a talk at SMBE this year where he said sliding windows are only good for giving false positive (but he said it in a much more diplomatic manner). If you bootstrap your data, you have a good chance of finding a significant window by chance. Not sure what precautions can be taken to avoid such statistical problems.

    Note: it appears that he removed the slides concerning sliding windows from the version of the talk he posted on the PAML webpage.

  2. nsaunders Post author

    I imagine things like “moving averages” are quite prone to over-interpretation.

    Worth pointing out that sliding windows have uses other than statistical calculations. For instance, I’m currently looking for N=7 windows that contain [ST] residues at window position 4. Using the above code examples, it’s as simple as:


    for(my $i = 1; $i <= $len-($winsize-1)); $i++) {
    my $window = $seqobj->subseq($i,$i+($winsize-1));
    if(substr($window,3,1) =~ /[ST]/) {
    ## retain window and do stuff
    }
    }

  3. New

    I wonder how can we use the sliding window in perl to average columns overs a sliding window instead of lines??

  4. nsaunders Post author

    There’s a Perl challenge for someone. You could always transpose your columns to rows first, if you wanted to avoid rewriting existing “rows code”.

  5. Norman Warthman

    hi there,

    thank you for posting the code. However, there where 2 tiny glitches. find correct code below

    best

    Norman

    ————————
    use strict;
    use Bio::SeqIO;
    use Getopt::Long;

    my $infile = “”;
    my $winsize = 7; #default
    GetOptions(‘infile=s’ => \$infile, ‘winsize=i’ => \$winsize);
    die(“Usage = window.pl –infile [–winsize ]\n”) if(!$infile);

    my $seqio = Bio::SeqIO->new(‘-file’ => $infile, ‘-format’ => ‘Fasta’);

    while(my $seqobj = $seqio->next_seq) {
    my $id = $seqobj->display_id;
    my $len = $seqobj->length;

    for(my $i = 1; $i subseq($i,$i+($winsize-1));
    # do clever stuff with window

    }
    }
    ——————————–

Comments are closed.