More on ontologies

Propeller Twist has some interesting thoughts on ontology and how best to standardise it.

I’m slowly coming around to the idea that this is a very important issue.  To me, it seems the problem is not how we describe objects (which are finite), but how we describe the relationships between them (which is very dependent on how individuals use words).  I also feel that current ontologies attempt to do too
much and be all things to all people, whereas the aim should be to keep things as simple and tight as possible.  In other words, there are too many verbs of the “interacts with, binds to etc.”
nature.  I don’t need to know that p53 IS_RARELY SWALLOWED_BY parrots, for instance.

Surely all we need to do is sit down and make a big list of nouns (objects in biological systems) and verbs (things that they do), then stick to it.

Methods versus discovery

One of my pet issues in bioinformatics is that of methodology versus discovery.  In other words – is your new, cute and clever piece of software any use if biologists are not using it to discover interesting things?

It’s a complex problem.  On the one hand without software development, we don’t have the new tools with which to make discoveries.  On the other if the focus is entirely on software development, programmers are free to publish endless short papers without practical application and biologists fail to see the
point of bioinformatics and so become ever more ignorant of and disenfranchised from the process of computational biology.

Ultimately I suppose, there can be a forum for both aspects.  I think we are seeing this in the way that journals are developing.  Bioinformatics, for instance, seems to be very much a forum for methods and algorithms these days, whereas journals such as Genome Research seem more focused on the application of tools to biological discovery.

Still, I worry that your “average” biologist, turning to a hard-core bioinformatics journal is likely to ask “what use is all this?”

Database responsibility

We’ve had some recent discussion at various bioinformatics blogs regarding web frontends to biological databases.  If you know some SQL and CGI, setting up a dynamic database is an easy and fun thing
to do.  Perhaps because it’s so easy, everyone and their dog seems to be doing it.  I’m guilty myself (published) and guilty
again (unpublished)

I have a few criteria for a web-based database, particularly if it’s published in a journal:

– it should contain interesting and useful biological information

– it should offer original and user-friendly functionality not found elsewhere

– it should work properly

I know that servers go down and we all have hardware, software and network problems occasionally, but if you publish – your resource should be accessible and functional.

SPdb, a signal peptide database seems to have some teething troubles (500 errors and so on).  I like the concept though – and did
manage to find data for a secreted archaeal protein, something that interests me.  Not clear to me yet if their data is all experimental or includes predictions (in which case, archaea would be

RISSC – 16/23S rDNA spacer database is a bad example.  Potentially useful information but so far as I can tell – totally non-functional.  The “Microsoft OLE” errors used to be a giveaway, but now it just does –
nothing.  At all.

RefSeq vs. GenBank

I’ve noticed occasional posts on the Bioperl lists regarding RefSeq, but have never paid much attention.  After all, RefSeq genbank (the file) is the same as GenBank genbank (the file), right?

As ever, the answer is “sort of but not quite”.  Here’s how a standard GenBank entry such as AY627381 might list the genes in an rRNA operon:


In this case, we’re looking at primary tags for the whole sequence, 16S, 16S-23S ITS, a tRNA and the 23S.

Now, here’s an example of a similar region from a ‘newer style’ RefSeq record in genbank format, typically a genome record such as NC_000909:


Here we have 23S, 2 tRNA genes and a 16S.  Spot the difference?  All genes now have 2 primary tags – gene/tRNA or gene/rRNA (and also gene/CDS for proteins).

What this means is – if I am reading a file and looking for features based on primary tag and separated by N genes, I need to use N genes for old-style genbank and N*2 genes for new-style RefSeq.  If I don’t know in advance what type I’m dealing with (because I just grabbed a load of records) or even how consistent this is within a type, I have problems.

Just another example of the daily problems caused by inconsistent primary data formats.

Perils of Debian unstable

So this is not exactly a bioinformatics post – but it’s been a while, so consider this a “day to day problems of working with computers” post, if you will.

Several of my machines are perilous experimental beasts – Debian unstable, CVS everything else (Bioperl and so on).  I know – I deserve everything I get.  In general though, I get very few
problems, though the daily updates are a bit tedious.

So today I’m running a perl script which includes a system call to a program named procheck and suddenly I see:

  unknown colorls variable `su’.

I realise that the perl is OK – so what’s going on?  A little research reveals that the LS_COLORS env variable, used to colour filenames using ‘ls’ is the culprit – it seems Debian unstable packages coreutils and fileutils have updated recently and the dircolors database includes this illegal option, ‘su’.

Recompiling procheck fails because the compilation script runs under csh, which is where the LS_COLORS problem lies.  So we edit the script, procomp.scr to ‘#!/bin/bash’ and it compiles fine.  As an aside – why are structural biologists so keen on csh?  Just wondered.  Then in our Perl script we add:

  $ENV{‘LS_COLORS’} = ”;

and procheck runs happily once more.  $ENV is a handy thing to know, by the way.

Gnuplot tips

I like gnuplot – it’s great for a quick visualisation of some data or for scripting a large number of quick and dirty plots.  I suspect that I’m not alone in that (a) I wish that I had more facility with it and (b) partly because of (a), I tend towards more pointy-clicky tools such as OpenOffice for publication-quality plots.

Anyway, this site has some great gnuplot tips.  My favourite – how do you label data points using gnuplot?  You require a separate file of labels (which I’d say was con not a pro), but a Perl one-liner can easily do this for you.  Assuming that your data matrix, “data.dat” looks like this:

  A B C D
a 1 2 3 4
b 5 6 7 8
c 1 2 3 4
d 5 6 7 8

and you want to plot columns 2:3 – that is (1,2), (5,6) and so on, using column 1 (a, b, c, d) as labels then:

perl -ane ‘print “set label \”$F[0]\” at $F[1],$F[2]\n”‘ data.dat > label.plt

does the trick.  Edit out the first line of label.plt, then in gnuplot:

load “label.plt”
plot “data.dat” using 2:3

This will centre the data labels on each point so you may need to fiddle with the exact positioning.

BIND database funding

Pedro highlights some big news – the BIND database has run out of funding.

We live in an age where free, public access to biological data is accepted as the norm and that is a good thing.  However, few of us give much thought as the costs and infrastructure required to maintain these services.  There are many issues here, not least of which is data provision from non-professional and potentially ephemeral sources – such as postdocs at universities.  The Web is a funny place – once you find a favourite site you tend to assume that it has always existed and will continue to do so, but 20 years ago there was no NCBI – who knows what we’ll have in 20 years time?

Our black box mentality means that we think of the Web and websites as entities in their own right.  Let’s not forget that behind every site there’s just software and hardware – and to vanish in an
instant, all that needs to happen is for someone to pull the plug.

OOo Statistics

I’ve recently been trying to get my head around some multivariate statistical methods, specifically canonical discriminant analysis (often referred to as CDA or MDA, or qda/lda if using R).

Googling around, I came across OOoMacros, a collection of macros for OpenOffice.  There you can find a package called OOo Statistics which includes several multivariate methods including CDA (Manova), principal components and correspondence analysis.

Just unzip, open the .sxc file and hit the install button.  On my Debian unstable, no joy using OpenOffice 2.0 but it works fine with 1.1.4.  Very slow – it is OO Basic after all, but at least I can get my data in and start analysing straight away, rather than reading R documentation for a week and being
none the wiser.