A beginner’s guide to BioRuby development

I’m the “biologist-turned-programmer” type of bioinformatician which makes me a hacker, not a developer. Most of the day-to-day coding that I do goes something like this:

Colleague: Hey Neil, can you write me a script to read data from file X, do Y to it and output a table in file Z?
Me: Sure… (clickety-click, hackety-hack…) …there you go.
Colleague: Great! Thanks.

I’m a big fan of the Bio* projects and have used them for many years, beginning with Bioperl and more recently, BioRuby. And I’ve always wanted to contribute some code to them, but have never got around to doing so. This week, two thoughts popped into my head:

  • How hard can it be?
  • There isn’t much introductory documentation for would-be Bio* developers

The answer to the first question is: given some programming experience, not very hard at all. This blog post is my attempt to address the second thought, by writing a step-by-step guide to developing a simple class for the BioRuby library. When I say “beginner’s guide”, I’m referring to myself as much as anyone else.

Note: I use Ubuntu, so this guide is aimed at Linux users.

1. Fork your own copy of BioRuby at GitHub
BioRuby resides at GitHub. Forking is the process by which you obtain your own copy of the code. It’s as simple as creating a GitHub account, going to the BioRuby repository and clicking the button labelled “Fork”. You now have a copy of BioRuby at yourusername/bioruby.

The next step is to get the code to your local machine, where you can work on it. Since my GitHub username is neilfws I created a directory with that name, moved into it and cloned my BioRuby fork like so:

mkdir -p ~/projects/github/neilfws
cd ~/projects/github/neilfws
git clone git@github.com:neilfws/bioruby.git

2. Decide which missing feature you want to implement
I’m rather fond of a software package called STRIDE. It takes a PDB file as input and outputs a file containing information about protein secondary structure. It’s very similar to DSSP, but in my opinion generates nicer output and comes with fewer licensing issues.

BioPerl comes with code for parsing both STRIDE and DSSP, but BioRuby does not. So my task: begin working on a BioRuby STRIDE class.

3. Use existing code as a guide
One of the best ways to get started when coding is to look at how people wrote code to solve a similar problem. Parsing the output of bioinformatics software is a very common task and in BioRuby, the directory lib/bio/appl contains code for this purpose. The first few subdirectories look like this:


./bio
├── appl
│   ├── bl2seq
│   ├── blast
│   ├── blast.rb
│   ├── blat
│   ├── clustalw
│   ├── clustalw.rb
...

In some cases, code is split between two locations (e.g. blast.rb and blast/report.rb; in others there’s a single file either in appl/ or a subdirectory of appl/.

Examining the directories and files, it appears that the simplest approach for our STRIDE class is to put the code for parsing in a file named report.rb, inside the subdirectory lib/bio/appl/stride.

cd ~/projects/github/neilfws/bioruby/lib
mkdir bio/appl/stride
touch bio/appl/stride/report.rb

All set – off we go. From now on, I’ll be assuming that our working directory is bioruby/lib:

ls
# bio bio.rb

4. Start writing your module
First, we need to modify the file bio.rb, so that it will load our new STRIDE code. Open it in a text editor, scroll down to the section which begins with the comment “### Applications” and at the bottom of that section, add:

autoload :Stride, 'bio/appl/stride/report'

I decided to call the class Stride rather than STRIDE, but that’s not final and easy enough to change later. This line simply loads the Stride class when first called, rather than by default.

That’s all we need to do with bio.rb. Next, open bio/appl/stride/report.rb for editing. The first thing to do is add a header section, which has a standard format:

#
# = bio/appl/stride/report.rb - STRIDE report classes
#
# Copyright::  Copyright (C) 2011
#              Neil Saunders <neilfws@somewhere.org>
# License::    The Ruby License
#
#  $Id:$
#
# == Description
#
#
# == Example
#
# == References
#

There’s not much to say about that except that later on, you should complete the Description, Example and References sections. They are described in README_DEV and are required if you want your code to enter the official BioRuby repository.

Now we can start the code for parsing. Ultimately, we want it do be able to do everything that the corresponding Bioperl module, Bio::Structure::SecStr::STRIDE::Res can do. However to keep it simple, we’ll start with the following functionality:

  • Read in a STRIDE output file
  • Calculate the total solvent-accessible area (the totSurfArea method in Bioperl)

First though, we need some sample STRIDE output. I installed STRIDE, downloaded the PDB file 1BG2 and generated STRIDE output, like so:

wget http://www.rcsb.org/pdb/files/1BG2.pdb.gz
gunzip 1BG2.pdb.gz
mkdir ~/projects/github/neilfws/bioruby/test/data/stride
stride 1BG2.pdb > ~/projects/github/neilfws/bioruby/test/data/stride/stride.out

Again, we’re using the standard location in the BioRuby tree for test data.

Without further ado – the Bio::Stride::Report class:

module Bio
  class Stride
    class Report
      attr_reader :tot_surf_area

      def initialize(str)
        @tot_surf_area = 0
        parse_report(str)
      end

      def parse_report(str)
        str.split("\n").each do |line|
          case line
          when /^ASG/
            asg = line.split(/\s+/)
            @tot_surf_area += asg[9].to_f
          end
        end
        @tot_surf_area = nil if @tot_surf_area == 0
      end
      private :parse_report

    end
  end
end

Nothing too difficult here. The first method expects a string and initializes a new object of class Bio::Stride::Report, with the instance variable @tot_surf_area = 0. Then it calls a private method which moves line-by-line through the STRIDE output, looks for lines beginning with ASG, gets solvent-accessible area for each residue from column 10 and adds it to @tot_surf_area. If for some reason the total area = 0 (e.g. because there are no ASG lines), total area is set to nil.

5. Check that your code works
My tip for testing: if you’re using IRB, start the console in bioruby/lib, the directory that contains bio/ and bio.rb. Otherwise, autoload gets terribly confused by relative paths and nothing works.

This is what you should see in the IRB console if everything is working:

require './bio.rb'
=> true
s = Bio::Stride::Report.new(File.open("../test/data/stride/stride.out").read)
=> #<Bio::Stride::Report:0x7f01b7c8df38 @tot_surf_area=15731.6>
s.tot_surf_area
=> 15731.6

6. Write some tests
Tests are good practice and are required for your code to be accepted into projects such as BioRuby. Again keeping things simple, here’s a unit test which checks that the total solvent-accessible area calculated for 1BG2 is the expected value. First, create the unit test file:

touch ../test/unit/bio/appl/stride/test_report.rb

Then, using the existing unit tests as a guide, edit the file to look like this:

#
# test/unit/bio/appl/stride/test_report.rb - Unit test for Bio::Stride::Report
#
# Copyright:  Copyright (C) 2011 Neil Saunders <neilfws@gmail.com>
# License::   The Ruby License
#
#  $Id:$
#

# loading helper routine for testing bioruby
require 'pathname'
load Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 4,
                            'bioruby_test_helper.rb')).cleanpath.to_s

# libraries needed for the tests
require 'test/unit'
require 'bio/appl/stride/report'


module Bio
  class TestStrideReport < Test::Unit::TestCase

    def setup
      data    = Pathname.new(File.join(BioRubyTestDataPath, 'stride')).cleanpath.to_s
      report  = File.open(File.join(data, 'stride.out')).read
      @obj    = Bio::Stride::Report.new(report)
    end

    def test_tot_surf_area
      assert_equal(15731.6, @obj.tot_surf_area)
    end

  end
end

Running that, again assuming that we’re still in bioruby/lib, should give something like this result:

ruby ../test/unit/bio/appl/stride/test_report.rb 
# Loaded suite ../test/unit/bio/appl/stride/test_report
# Started
# ..
# Finished in 0.007696 seconds.

# 1 test, 1 assertion, 0 failures, 0 errors

7. Push to your Github repository
Terrific – we’ve created our own Bio::Stride::Report class and implemented a method. Time to commit and push our good work back to GitHub:

git add .
git commit -a -m "Initial commit of Bio::Stride::Report class"
git push origin master

8. Add more methods, tests and documentation
At this stage, your BioRuby fork is publicly available for you and anyone else to play around with. However, if you think your code is useful enough for consideration by the main BioRuby project, there’s some more work to do.

First, add more methods. Ideally, we’d like our Stride class to do everything that the Bioperl module can do.

Second, all of those methods require tests and your code in general needs to be of sufficient quality for use by other people.

Third – plenty of documentation, in the form of comments and RDoc markup.

9. Check for compliance with BioRuby guidelines
That link again: README_DEV, in the root of the BioRuby repository. The document contains information regarding the “typical coding style” used by BioRuby modules, as well as hints for providing good documentation and examples.

10. Send a pull request
When your code is ready for consideration, you can send a pull request to the BioRuby team. This is as simple as clicking the “Pull Request” button on your GitHub BioRuby fork page. Not something that I’ve done as yet but presumably, they’ll let you know whether they like it or not.

Summary
I went from having little confidence in my ability to modify BioRuby, to a working class – albeit a simple one, in 2 days. This is testament in large part, I think, to the power of Git + Github as a development toolkit, as well as the ease of programming in Ruby.

What next? I hope, given time, to continue developing and eventually pluck up the courage to click that “Pull Request” button. If you like, you can follow along at my BioRuby fork.

8 thoughts on “A beginner’s guide to BioRuby development

  1. Great post Neil. The more these scripts are codified into libraries the less repetition of code there is. Can I make a suggestion? Instead of making your commits on your master branch create a topic branch related to the feature you want to add. These topic branches make it simpler to see what the new commits are and then merge them in. There is a really nice description of this – http://bit.ly/ou67iq.

    I think another benefit of open source libraries is that it’s much easier to improve on them once formally codified. This is not always the case with quick scripts.

  2. Thanks for the great write-up.
    Future follow up post might include how to merge in the latest changes from the main bioruby repo to keep your fork up-to-date.

  3. Great post; this is something that is applicable to any of the Bio* projects. BioPerl, Biopython, and even BioSQL are also on GitHub, so one can follow the same path for any of those.

  4. For extending BioRuby to be able to deal with new types of programs, perhaps it would be a better idea to create a BioRuby gem
    https://github.com/helios/bioruby-gem

    This is instead of attempting to make your changes directly to the official code repository via pull requests. Given that that bioruby-gem is designed exactly for this purpose (and is officially supported), things are a bit more setup for you already, and people can use your code even if it never makes it into an official bioruby release.
    ben

Comments are closed.