Category Archives: computing

On the road: CSS and eResearch Conference 2014

Next week I’ll be in Melbourne for one of my favourite meetings, the annual Computational and Simulation Sciences and eResearch Conference.

The main reason for my visit is the Bioinformatics FOAM workshop. Day 1 (March 27) is not advertised since it is an internal CSIRO day, but I’ll be presenting a talk titled “SQL, noSQL or no database at all? Are databases still a core skill?“. Day 2 (March 28) is open to all and I’ll be talking about “Learning from complete strangers: social networking for bioinformaticians“.

I imagine these and other talks will appear on Slideshare soon, at both my account and that of the Australian Bioinformatics Network.

I’m also excited to see that Victoria Stodden is presenting a keynote at the main CSS meeting (PDF) on “Reproducibility in Computational Science: Opportunities and Challenges”.

Hope to see some of you there.

Quilt plots. Like heat maps, only…heat maps

Stephen tweets:

A "quilt plot"

A “quilt plot”


Quilt plots. Sounds interesting. The link points to a short article in PLoS ONE, containing a table and a figure. Here is Figure 1.

If you looked at that and thought “Hey, that’s a heat map!”, you are correct. That is a heat map. Let’s be quite clear about that. It’s a heat map.

So, how do the authors justify publishing a method for drawing heat maps and then calling them “quilt plots”?
Read the rest…

Career advice: switching to computational research

Laboratory work, of the “wet” kind, not working out for you? Or perhaps you just need new challenges. Think you have some aptitude with data analysis, computers, mathematics, statistics? Maybe a switch to computational biology is what you need.

That’s the topic of the Nature Careers feature “Computing: Out of the hood“. With thoughts and advice from (on Twitter) @caseybergman, @sarahmhird, @kcranstn, @PavelTomancak, @ctitusbrown and myself.

I enjoyed talking with Roberta and she did a good job of capturing our thoughts for the article. One of these days, I might even write here about my own journey in more detail.

Git for bioinformaticians at the Bioinformatics FOAM meeting

Last week, I attended the annual Computational and Simulation Sciences and eResearch Conference, hosted by CSIRO in Melbourne. The meeting includes a workshop that we call Bioinformatics FOAM (Focus On Analytical Methods). This year it was run over 2.5 days (up from the previous 1.5 by popular request); one day for internal CSIRO stuff and the rest open to external participants.

I had the pleasure of giving a brief presentation on the use of Git in bioinformatics. Nothing startling; aimed squarely at bioinformaticians who may have heard of version control in general and Git in particular but who are yet to employ either. I’m excited because for once I am free to share, resulting in my first upload to Slideshare in almost 4.5 years. You can view it here, or at the Australian Bioinformatics Network Slideshare, or in the embed below.

See the slides…

Redmine + Gitolite integration

I’m a big fan of both Redmine, the project management web application and Git, the distributed version control system.

Recently, I learned that it’s possible to integrate Git into Redmine so that git repositories for a project can be created via the Redmine web interface. This is done using plugins which connect Redmine with git hosting software: either gitosis or more recently, gitolite.

Unfortunately, this is a deeply-confusing process for novices like myself. There are multiple forks of the plugins, long threads in the Redmine forums that discuss various hacks/tweaks to make things work and no one authoritative source of documentation. After much experimentation, this is what worked for me. I can’t guarantee success for you.

Read the rest…

How to: create a partial UCSC genome MySQL database

File under: simple, but a useful reminder

UCSC Genome Bioinformatics is one of the go-to locations for genomic data. They are also kind enough to provide access to their MySQL database server:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A

However, users are given fair warning to “avoid excessive or heavy queries that may impact the server performance.” It’s not clear what constitutes excessive or heavy but if you’re in any doubt, it’s easy to create your own databases locally. It’s also easy to create only the tables that you require, as and when you need them.
As an example, here’s how you could create only the ensGene table for the latest hg19 database. Here, USER and PASSWD represent a local MySQL user and password with full privileges:

# create database
mysql -u USER -pPASSWD -e 'create database hg19'
# obtain table schema
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.sql
# create table
mysql -u USER -pPASSWD hg19 < ensGene.sql
# obtain and import table data
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz
gunzip ensGene.txt.gz
mysqlimport -u USER -pPASS --local hg19 ensGene.txt

It’s very easy to automate this kind of process using shell scripts. All you need to know is the base URL for the data, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ and that there are two files with the same prefix per table: one for the schema (*.sql) and one with the data (*.txt.gz).

Why can’t PubMed or academic journals get the basics right?

A recent question at BioStar asked “Is the NAR database list available in a computer readable format?” The short answer is “no” and Pierre has done some excellent preliminary work to address the issue.

I’ve been working on a database and web application to check the associated URLs but quite frankly, this is tedious, a waste of everyone’s time and could be entirely avoided if the publishing industry did a better job. All that’s required is that either NAR or PubMed provide structured data – XML, Medline format, I don’t care what – containing a field that looks something like this:

URL    http://a.valid.url.goes.here

That way, we could all avoid writing regular expressions to detect URLs in abstracts. No wait – to detect broken URLs in abstracts. You would not believe how many of them look like this:

URL    http://www.amaze.ulb. ac.be/
                            ^

Someone helpfully informed me via Twitter that this is “often a result of typesetting.” Thanks for that.

R 2.12 to 2.13 package upgrade

If you:

  • use Linux
  • have just upgraded your R installation from 2.12 to 2.13
  • installed some/all of your packages in your home area (e.g. ~/R/i486-pc-linux-gnu-library/2.12) and…
  • …are wondering why R can’t see them any more

just do this:

# at a shell prompt
cp -r ~/R/i486-pc-linux-gnu-library/2.12 ~/R/i486-pc-linux-gnu-library/2.13
# in R console
update.packages(checkBuilt=TRUE, ask=FALSE)
# back to the shell
rm -rf ~/R/i486-pc-linux-gnu-library/2.12

update: corrected a typo; of course you need “cp -r”

Fixing aberrant files using R and the shell: a case study

Once in a while, you embark on what looks like a simple computational procedure only to encounter frustration very early on. “I can’t even read my file into R!” you cry.

Step back, take a deep breath and take note of what the software is trying to tell you. Most times, you’ve just missed something very straightforward. Here’s an example.

Update: this post is not about how best to perform the task; it’s about how to cope with frustration. Please stop sending me your solutions :-)
Continue reading