Monthly Archives: September 2008

It would be too easy to rant and rave about this

Zotero is a marvellous, active open-source project, providing a Firefox extension that captures and formats bibliographic information from web pages.

Thomson Reuters describe themselves as “the world’s leading source of intelligent information for businesses and professionals.” Whatever. They specialise in closed-source, proprietary solutions which to my simple mind is at odds with a role as an information source.

Via FriendFeed from Rafael Sidi’s blog, I learn that Thomson Reuters are suing George Mason University, developers of Zotero, for “violating its license agreement and destroying the EndNote customer base”.

Here’s my simple, black-and-white view of the world. The greatest achievement of the internet is the potential to set information free. There are free-thinking, forward-looking organisations like GMU who see this potential and act upon it. There are also organisations who see only threats to their corporate interests. Publishing corporations no longer control the flow of information to consumers and some of them seem to be struggling to accept this, adapt and move on.

As I say, too easy to rant and rave. If you’d like to do so in the comments, feel free.

Genomic analysis of Pseudoalteromonas tunicata

Some years ago, I provided advice and a little analysis for a group at UNSW studying marine bacteria. It’s nice to see that they remembered me:

Thomas, T., Evans, F.F., Schleheck, D., Mai-Prochnow, A., Burke, C., Penesyan, A., Dalisay, D.S., Stelzer-Braid, S., Saunders, N., Johnson, J., Ferriera, S., Kjelleberg, S. and Egan, S. (2008).
Analysis of the Pseudoalteromonas tunicata Genome Reveals Properties of a Surface-Associated Life Style in the Marine Environment.
PLoS ONE 3:e3252.

If correlating genomic features with microbial physiology is your thing, go and check it out. The article is open access, for your pleasure – as are five of my last six efforts, I just noticed.

Not as many structures as you might think

In the midst of preparing a talk for next Monday. It occurred to me that perhaps we don’t see more protein structure-based prediction in bioinformatics because – there aren’t enough structures.

pdbstats

pdbstats

Sure, the PDB has grown a lot in the past 5 years or so and 53 103 structures (as of now) looks impressive. However, if you’re interested in protein-protein interaction, you want at least 2 chains: which more or less halves the dataset. If you want two different protein chains, you lose almost another 75%. Let’s specify a reasonable minimum resolution for X-ray diffraction data and there go ~ 3 000 entries. We probably don’t want multiple, similar proteins so let’s remove sequence identity at a redundancy of 90%. We’re left with about 2% of the original PDB, which might be useable for looking at interactions.

No wonder that most bioinformatics focuses on sequences and high-throughput interaction data.

On parsing

Parsing – the act of ripping through a file, pulling out the relevant parts and doing something useful with them, is an integral part of bioinformatics. It can be a dull procedure. It can also be challenging, requiring creativity and imagination. Frequently as a bioinformatician, you will generate output from an unfamiliar program, or a colleague will bring you a file that you haven’t encountered. Your task is to figure out how the file is structured, which regular expressions are required to parse it, what kind of output to produce and most importantly, how to handle those rogue files which don’t obey the rules.

Here’s my top ten (language-agnostic) parsing tips, focusing only on non-XML text files.
Read the rest…

Science in the petabyte era

Just a brief note: the title of this post is taken from the cover of today’s Nature. It contains several very good feature articles on the challenges of dealing with peta- (and more) byte size datasets, grouped under the heading “Big data”.

Nature contents Sep 4 2008.
Nature News Big Data special.

By far the best of the articles is The future of biocuration: it offers practical recommendations, as opposed to the “gee whizz, what a lot of data” approach. Not least of which: “curators, researchers, academic institutions and funding agencies should, in the next ten years, increase the visibility and support of scientific curation as a professional career.”
Almost as good are Wikiomics, which tackles the lack of participation issue and Welcome to the petacentre, in which Boing-Boing’s Cory Doctorow explores, amongst other places, the Sanger Institute data centre.

So far as I can tell from the Nature News link, these articles are freely-available.

Data capture versus data archiving

The commonest complaint that I hear whenever electronic lab notebooks (ELNs) or laboratory information management systems (LIMS) are discussed is that it doubles the workload. People who work in labs enjoy the convenience of their paper notebooks. They perform an action or a process occurs – they write a note. A machine generates a photo – they tear it off and paste it in. Transferring that information to a digital archive is a pain: they have to sit down at a computer with their lab book, scan and upload images, enter text into form fields and so on.

I sympathise, absolutely. At present, data capture and data archiving are for most people, disconnected processes. Their only comfort is that smart people are working on these problems. One day, laboratory equipment will emit data in machine-readable format directly to digital archives, lab members will carry PDA-like devices and note-taking as we know it will become a relic of the past. That day is some way off, but it will come.

As to why they should invest time in archiving data – just answer these questions:

  • Is your paper notebook searchable?
  • Can other people use old records from your paper notebook to do anything practical?
  • For that matter, can you?
  • Imagine that you have just moved to a new lab and none of your predecessors, now moved on and not contactable, left any record of their activity – how would you feel?

These are questions for individuals, but also I feel for a training system (academia) that encourages individual prowess over community spirit.