Archive for ‘computing’

March 27, 2013

Git for bioinformaticians at the Bioinformatics FOAM meeting

Last week, I attended the annual Computational and Simulation Sciences and eResearch Conference, hosted by CSIRO in Melbourne. The meeting includes a workshop that we call Bioinformatics FOAM (Focus On Analytical Methods). This year it was run over 2.5 days (up from the previous 1.5 by popular request); one day for internal CSIRO stuff and the rest open to external participants.

I had the pleasure of giving a brief presentation on the use of Git in bioinformatics. Nothing startling; aimed squarely at bioinformaticians who may have heard of version control in general and Git in particular but who are yet to employ either. I’m excited because for once I am free to share, resulting in my first upload to Slideshare in almost 4.5 years. You can view it here, or at the Australian Bioinformatics Network Slideshare, or in the embed below.

See the slides…

April 24, 2012

Redmine + Gitolite integration

I’m a big fan of both Redmine, the project management web application and Git, the distributed version control system.

Recently, I learned that it’s possible to integrate Git into Redmine so that git repositories for a project can be created via the Redmine web interface. This is done using plugins which connect Redmine with git hosting software: either gitosis or more recently, gitolite.

Unfortunately, this is a deeply-confusing process for novices like myself. There are multiple forks of the plugins, long threads in the Redmine forums that discuss various hacks/tweaks to make things work and no one authoritative source of documentation. After much experimentation, this is what worked for me. I can’t guarantee success for you.

Read the rest…

May 18, 2011

How to: create a partial UCSC genome MySQL database

File under: simple, but a useful reminder

UCSC Genome Bioinformatics is one of the go-to locations for genomic data. They are also kind enough to provide access to their MySQL database server:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A

However, users are given fair warning to “avoid excessive or heavy queries that may impact the server performance.” It’s not clear what constitutes excessive or heavy but if you’re in any doubt, it’s easy to create your own databases locally. It’s also easy to create only the tables that you require, as and when you need them.
As an example, here’s how you could create only the ensGene table for the latest hg19 database. Here, USER and PASSWD represent a local MySQL user and password with full privileges:

# create database
mysql -u USER -pPASSWD -e 'create database hg19'
# obtain table schema
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.sql
# create table
mysql -u USER -pPASSWD hg19 < ensGene.sql
# obtain and import table data
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz
gunzip ensGene.txt.gz
mysqlimport -u USER -pPASS --local hg19 ensGene.txt

It’s very easy to automate this kind of process using shell scripts. All you need to know is the base URL for the data, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ and that there are two files with the same prefix per table: one for the schema (*.sql) and one with the data (*.txt.gz).

Tags: , ,
April 19, 2011

Why can’t PubMed or academic journals get the basics right?

A recent question at BioStar asked “Is the NAR database list available in a computer readable format?” The short answer is “no” and Pierre has done some excellent preliminary work to address the issue.

I’ve been working on a database and web application to check the associated URLs but quite frankly, this is tedious, a waste of everyone’s time and could be entirely avoided if the publishing industry did a better job. All that’s required is that either NAR or PubMed provide structured data – XML, Medline format, I don’t care what – containing a field that looks something like this:

URL    http://a.valid.url.goes.here

That way, we could all avoid writing regular expressions to detect URLs in abstracts. No wait – to detect broken URLs in abstracts. You would not believe how many of them look like this:

URL    http://www.amaze.ulb. ac.be/
                            ^

Someone helpfully informed me via Twitter that this is “often a result of typesetting.” Thanks for that.

April 15, 2011

R 2.12 to 2.13 package upgrade

If you:

  • use Linux
  • have just upgraded your R installation from 2.12 to 2.13
  • installed some/all of your packages in your home area (e.g. ~/R/i486-pc-linux-gnu-library/2.12) and…
  • …are wondering why R can’t see them any more

just do this:

# at a shell prompt
cp -r ~/R/i486-pc-linux-gnu-library/2.12 ~/R/i486-pc-linux-gnu-library/2.13
# in R console
update.packages(checkBuilt=TRUE, ask=FALSE)
# back to the shell
rm -rf ~/R/i486-pc-linux-gnu-library/2.12

update: corrected a typo; of course you need “cp -r”

April 8, 2011

Fixing aberrant files using R and the shell: a case study

Once in a while, you embark on what looks like a simple computational procedure only to encounter frustration very early on. “I can’t even read my file into R!” you cry.

Step back, take a deep breath and take note of what the software is trying to tell you. Most times, you’ve just missed something very straightforward. Here’s an example.

Update: this post is not about how best to perform the task; it’s about how to cope with frustration. Please stop sending me your solutions :-)

Tags: , , , ,
March 4, 2011

Ruby Version Manager: the absolute basics

Confession: I’m still using Ruby version 1.8.7, from the Ubuntu 10.10 repository. Why? I have a lot of working code, I don’t know how well it works using Ruby 1.9 and I’m worried that migration will break things and make me miserable.

Various people have pointed me to RVM – Ruby Version Manager. As the name suggests, it allows you to manage multiple Ruby versions on your machine. Today, I needed to test an application written for Ruby 1.9.2, so I used RVM for the first time. Here are the absolute basics, for anyone who just wants to test some code using Ruby 1.9, without messing up their existing system:

# install rvm
bash < <( curl http://rvm.beginrescueend.com/releases/rvm-install-head )
# add it to your .bashrc
echo '[[ -s "$HOME/.rvm/scripts/rvm" ]] && source "$HOME/.rvm/scripts/rvm"' >> ~/.bashrc
# add tab completion to your .bashrc
echo '[[ -r $rvm_path/scripts/completion ]] && . $rvm_path/scripts/completion' >> ~/.bashrc
# source the script; check installation
source ~/.rvm/scripts/rvm
rvm -v
# returns (e.g.) rvm 1.2.8 by Wayne E. Seguin (wayneeseguin@gmail.com) [http://rvm.beginrescueend.com/]
# install ruby 1.9.2
rvm install 1.9.2
# use ruby 1.9.2
rvm use 1.9.2
ruby -v
# returns (e.g.) ruby 1.9.2p180 (2011-02-18 revision 30909) [i686-linux]
# do stuff as normal (install gems etc.)
# when you're done, switch back to system ruby
rvm system
ruby -v
# returns (e.g.) ruby 1.8.7 (2010-06-23 patchlevel 299) [i686-linux]

That’s it. The key thing is that RVM sets up Ruby in $HOME/.rvm so for example, when using version 1.9.2 under RVM, gem install GEMNAME will install to $HOME/.rvm/gems/ruby-1.9.2-p180/gems/. Your system files are untouched.

Tags:
March 1, 2011

The RStudio IDE: first impressions are positive

Integrated development environments (IDEs) are software development tools, providing an interface that enables you to write, debug, run and view the output of your code.

Whether you need an IDE or find them useful depends very much on your own preferences and style of working. In my own case for example, I’ve tried both Eclipse and NetBeans, but I find them bloated and rather “overkill”. On the other hand, my LaTeX productivity shot up when I started to use Kile.

Most of my coding involves either Ruby or R, written using Emacs. For Ruby (including Rails), I use a bundled set of plugins named my_emacs_for_rails, which includes the Emacs Code Browser (ECB). For R, I occasionally use Emacs Speaks Statistics (ESS), but I’m just as likely to run code from a terminal or use the R console.

RStudio, released yesterday, is a new open-source IDE for R. It’s getting a lot of attention at R-bloggers and it’s easy to see why: this is open-source software development done right.
Read the rest…

Tags: ,
February 14, 2011

Getting “stuff” into MongoDB

One of the aspects I like most about MongoDB is the “store first, ask questions later” approach. No need to worry about table design, column types or constant migrations as design changes. Provided that your data are in some kind of hash-like structure, you just drop them in.

Ruby is particularly useful for this task, since it has many gems which can parse common formats into a hash. Here are 3 quick examples with relevance to bioinformatics.
Read the rest…

February 13, 2011

Dumped on by data scientists

A story in The Chronicle of Higher Education reminded me that I’ve been meaning to write about “data science” for some time.

The headline to the story:

“Dumped On by Data: Scientists Say a Deluge Is Drowning Research”

Rather amusingly, this is abbreviated in the URL to “Dumped-On-by-Data-Scientists”; a nice example of how the same words, broken in the wrong place, can lead to a completely different meaning.

Anyway, to the point. The term “data scientist” – a good thing, or not?
I’m throwing this one out there because I spent much of 2010 (a) reading articles that used the term and (b) trying to decide whether I like it or not – and I still can’t decide.

Arguments for:

  • It’s an attention-grabber, designed to make us think about the tools and skills required to analyse “big data” in the same way that “NoSQL” is designed to make us think about alternative database solutions

Arguments against:

  • The “data” part is redundant, since all scientists deal with data
  • It belittles the job title of “scientist”; the term might be construed as dismissive of the education, training and skills required to do “boring old school science” as opposed to “new, flashy sexy data science”
  • Many (most?) “data scientists” do business intelligence, not science; crunching Twitter posts to help formulate a better product marketing strategy is not the same as addressing a genuine scientific problem

At the heart of the issue, I feel, is a different approach to data. In “data science” we start with everything, give it a shake and see if answers to our questions fall out. In “real science” we start with a specific question, generate data designed to answer that question and see what falls out. Perhaps they are just different philosophies and mindsets. Perhaps each can learn from the other.

I guess with one “for” and three “against” I’ve decided that I don’t like the term “data scientist”, but I can’t quite shake the feeling that it has some use. What do you think?

Follow

Get every new post delivered to your Inbox.

Join 2,204 other followers