I’m the “biologist-turned-programmer” type of bioinformatician which makes me a hacker, not a developer. Most of the day-to-day coding that I do goes something like this:
Colleague: Hey Neil, can you write me a script to read data from file X, do Y to it and output a table in file Z?
Me: Sure… (clickety-click, hackety-hack…) …there you go.
Colleague: Great! Thanks.
I’m a big fan of the Bio* projects and have used them for many years, beginning with Bioperl and more recently, BioRuby. And I’ve always wanted to contribute some code to them, but have never got around to doing so. This week, two thoughts popped into my head:
- How hard can it be?
- There isn’t much introductory documentation for would-be Bio* developers
The answer to the first question is: given some programming experience, not very hard at all. This blog post is my attempt to address the second thought, by writing a step-by-step guide to developing a simple class for the BioRuby library. When I say “beginner’s guide”, I’m referring to myself as much as anyone else.
Note: I use Ubuntu, so this guide is aimed at Linux users.
1. Fork your own copy of BioRuby at GitHub
BioRuby resides at GitHub. Forking is the process by which you obtain your own copy of the code. It’s as simple as creating a GitHub account, going to the BioRuby repository and clicking the button labelled “Fork”. You now have a copy of BioRuby at yourusername/bioruby.
The next step is to get the code to your local machine, where you can work on it. Since my GitHub username is neilfws I created a directory with that name, moved into it and cloned my BioRuby fork like so:
mkdir -p ~/projects/github/neilfws cd ~/projects/github/neilfws git clone firstname.lastname@example.org:neilfws/bioruby.git
2. Decide which missing feature you want to implement
I’m rather fond of a software package called STRIDE. It takes a PDB file as input and outputs a file containing information about protein secondary structure. It’s very similar to DSSP, but in my opinion generates nicer output and comes with fewer licensing issues.
3. Use existing code as a guide
One of the best ways to get started when coding is to look at how people wrote code to solve a similar problem. Parsing the output of bioinformatics software is a very common task and in BioRuby, the directory lib/bio/appl contains code for this purpose. The first few subdirectories look like this:
│ ├── bl2seq
│ ├── blast
│ ├── blast.rb
│ ├── blat
│ ├── clustalw
│ ├── clustalw.rb
In some cases, code is split between two locations (e.g. blast.rb and blast/report.rb; in others there’s a single file either in appl/ or a subdirectory of appl/.
Examining the directories and files, it appears that the simplest approach for our STRIDE class is to put the code for parsing in a file named report.rb, inside the subdirectory lib/bio/appl/stride.
cd ~/projects/github/neilfws/bioruby/lib mkdir bio/appl/stride touch bio/appl/stride/report.rb
All set – off we go. From now on, I’ll be assuming that our working directory is bioruby/lib:
ls # bio bio.rb
4. Start writing your module
First, we need to modify the file bio.rb, so that it will load our new STRIDE code. Open it in a text editor, scroll down to the section which begins with the comment “### Applications” and at the bottom of that section, add:
autoload :Stride, 'bio/appl/stride/report'
I decided to call the class Stride rather than STRIDE, but that’s not final and easy enough to change later. This line simply loads the Stride class when first called, rather than by default.
That’s all we need to do with bio.rb. Next, open bio/appl/stride/report.rb for editing. The first thing to do is add a header section, which has a standard format:
# # = bio/appl/stride/report.rb - STRIDE report classes # # Copyright:: Copyright (C) 2011 # Neil Saunders <email@example.com> # License:: The Ruby License # # $Id:$ # # == Description # # # == Example # # == References #
There’s not much to say about that except that later on, you should complete the Description, Example and References sections. They are described in README_DEV and are required if you want your code to enter the official BioRuby repository.
Now we can start the code for parsing. Ultimately, we want it do be able to do everything that the corresponding Bioperl module, Bio::Structure::SecStr::STRIDE::Res can do. However to keep it simple, we’ll start with the following functionality:
- Read in a STRIDE output file
- Calculate the total solvent-accessible area (the totSurfArea method in Bioperl)
First though, we need some sample STRIDE output. I installed STRIDE, downloaded the PDB file 1BG2 and generated STRIDE output, like so:
wget http://www.rcsb.org/pdb/files/1BG2.pdb.gz gunzip 1BG2.pdb.gz mkdir ~/projects/github/neilfws/bioruby/test/data/stride stride 1BG2.pdb > ~/projects/github/neilfws/bioruby/test/data/stride/stride.out
Again, we’re using the standard location in the BioRuby tree for test data.
Without further ado – the Bio::Stride::Report class:
module Bio class Stride class Report attr_reader :tot_surf_area def initialize(str) @tot_surf_area = 0 parse_report(str) end def parse_report(str) str.split("\n").each do |line| case line when /^ASG/ asg = line.split(/\s+/) @tot_surf_area += asg.to_f end end @tot_surf_area = nil if @tot_surf_area == 0 end private :parse_report end end end
Nothing too difficult here. The first method expects a string and initializes a new object of class Bio::Stride::Report, with the instance variable @tot_surf_area = 0. Then it calls a private method which moves line-by-line through the STRIDE output, looks for lines beginning with ASG, gets solvent-accessible area for each residue from column 10 and adds it to @tot_surf_area. If for some reason the total area = 0 (e.g. because there are no ASG lines), total area is set to nil.
5. Check that your code works
My tip for testing: if you’re using IRB, start the console in bioruby/lib, the directory that contains bio/ and bio.rb. Otherwise, autoload gets terribly confused by relative paths and nothing works.
This is what you should see in the IRB console if everything is working:
require './bio.rb' => true s = Bio::Stride::Report.new(File.open("../test/data/stride/stride.out").read) => #<Bio::Stride::Report:0x7f01b7c8df38 @tot_surf_area=15731.6> s.tot_surf_area => 15731.6
6. Write some tests
Tests are good practice and are required for your code to be accepted into projects such as BioRuby. Again keeping things simple, here’s a unit test which checks that the total solvent-accessible area calculated for 1BG2 is the expected value. First, create the unit test file:
Then, using the existing unit tests as a guide, edit the file to look like this:
# # test/unit/bio/appl/stride/test_report.rb - Unit test for Bio::Stride::Report # # Copyright: Copyright (C) 2011 Neil Saunders <firstname.lastname@example.org> # License:: The Ruby License # # $Id:$ # # loading helper routine for testing bioruby require 'pathname' load Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 4, 'bioruby_test_helper.rb')).cleanpath.to_s # libraries needed for the tests require 'test/unit' require 'bio/appl/stride/report' module Bio class TestStrideReport < Test::Unit::TestCase def setup data = Pathname.new(File.join(BioRubyTestDataPath, 'stride')).cleanpath.to_s report = File.open(File.join(data, 'stride.out')).read @obj = Bio::Stride::Report.new(report) end def test_tot_surf_area assert_equal(15731.6, @obj.tot_surf_area) end end end
Running that, again assuming that we’re still in bioruby/lib, should give something like this result:
ruby ../test/unit/bio/appl/stride/test_report.rb # Loaded suite ../test/unit/bio/appl/stride/test_report # Started # .. # Finished in 0.007696 seconds. # 1 test, 1 assertion, 0 failures, 0 errors
7. Push to your Github repository
Terrific – we’ve created our own Bio::Stride::Report class and implemented a method. Time to commit and push our good work back to GitHub:
git add . git commit -a -m "Initial commit of Bio::Stride::Report class" git push origin master
8. Add more methods, tests and documentation
At this stage, your BioRuby fork is publicly available for you and anyone else to play around with. However, if you think your code is useful enough for consideration by the main BioRuby project, there’s some more work to do.
First, add more methods. Ideally, we’d like our Stride class to do everything that the Bioperl module can do.
Second, all of those methods require tests and your code in general needs to be of sufficient quality for use by other people.
Third – plenty of documentation, in the form of comments and RDoc markup.
9. Check for compliance with BioRuby guidelines
That link again: README_DEV, in the root of the BioRuby repository. The document contains information regarding the “typical coding style” used by BioRuby modules, as well as hints for providing good documentation and examples.
10. Send a pull request
When your code is ready for consideration, you can send a pull request to the BioRuby team. This is as simple as clicking the “Pull Request” button on your GitHub BioRuby fork page. Not something that I’ve done as yet but presumably, they’ll let you know whether they like it or not.
I went from having little confidence in my ability to modify BioRuby, to a working class – albeit a simple one, in 2 days. This is testament in large part, I think, to the power of Git + Github as a development toolkit, as well as the ease of programming in Ruby.
What next? I hope, given time, to continue developing and eventually pluck up the courage to click that “Pull Request” button. If you like, you can follow along at my BioRuby fork.