The scriptome

Can’t recall if I’ve mentioned the scriptome project before. I know we’ve covered it at Nodalpoint, where I might have been unintentionally rude about it.

There’s a nice write-up of the project over at Perl.com, very much in the Perl.com style. The author points out that Scriptome is not just a collection of data-munging tools; it’s an attempt to make biologists think about how much time they waste doing repetitive tasks that could be easily automated if they bothered to learn some simple skills. From the introduction:

Have you ever renamed 768 files? Merged the content from 96 files into a spreadsheet? Filtered 100 lines out of a 20,000-line file?
Have you ever done these things by hand?
Disciples of laziness–one of the three Perl programmer’s virtues–know that you should never repeat anything five times, let alone 768. It dismayed me to learn that biologists do this kind of thing all the time.

Experimental biologists increasingly face large sets of large files in often-incompatible formats, which they need to filter, reformat, merge, and otherwise munge. Biologists who can’t write Perl (most of them) often end up editing large files by hand. When they have the same problem a week later, they do the same thing again–or they just give up.

Don’t even get me started on people who use Word for sequence analysis.

7 thoughts on “The scriptome

  1. Matthias Steffens

    Neil, I’d be pretty amused by your post but unfortunately I’m a biological scientist (even an “ecologist” which somehow equals “muddy science” or “always too many variables”) and I know what you’re talking about. :- / Most of my collegues (even the young ones) are reluctant to learn any new method that would greatly speed up their task if it “feels” too techy. Besides your examples, search & replace methods that use the most simple regular expressions (such as “.+”) are of these techy things that a “true biologist” seems to avoid at all circumstances. Heck, even MS Word has some kind of (odd but nonetheless helpful) wildcard search… Being faced with this reluctance (or is it fear, ignorance?) can be very frustrating. I wonder whether this has to do with education history, a physicist surely has more training with this sort of stuff. –M

  2. Ynse

    Neil, if you can give me an example of an editor which can highlight and colour columns (imagine alignment editing) and will export the file into a world-readable format (let’s say, html) I will be more than happy to stop using MSWord today. For the majority of protein sequences dedicated tools for analysis will be fine, but unfortunately for sequence that is large and repeated Word is the end not the beginning.

  3. nsaunders Post author

    example of an editor which can highlight and colour columns

    Well off the top of my head there’s boxshade, EMBOSS prettyplot, GeneDoc and I think even ClustalX. Rather than play “name the software” though, why don’t people try some sensible Google queries such as “multiple alignment display presentation software” and see what’s out there.

    What you’re talking about is document formatting – not sequence analysis/manipulation. This is precisely the problem that we’re talking about: unwillingness or inability to think outside of what users think is “normal” (i.e. Microsoft), which leads to using inappropriate tools for the job. Word is a word processor (allegedly). It’s for writing and formatting documents. If you want to mess with alignments, you need “mess with alignments software.”

    It’s also news to me that Word can reliably align columns or generate valid HTML.

  4. Ynse

    As I said – all the tools for “presentation” would be fine as long as you play with _multiple_ sequence alignment. Presentation is not an issue – the analysis (or let’s call it playing) of _single_ sequence is. Word doesn’t even qualify as a proper tool for writing and formatting documents but I use it for two reasons: no other tool allows for “playing” with the _single_ sequence as well as Word and I am forced to use it by publishers.

    What I do is a manual alignment – I insert gaps and tap enter trying to align internal repeats within the sequence. And Word can reliably align columns (at least when I’m using Courier fonts) and color the whole column (an issue if you have 20 internal repeats and want to do it letter by letter). HTML is of course a mess.

    The point I wanted to make is that although you are right about the majority of needs and users, for some specialized needs there’s not much beyond the “word processor”. This includes the recent “IDE”s for protein sequence analysis (standalone tools integrating several services, like STRAP) – none of them is having good sequence editing module.

  5. nsaunders Post author

    OK – we’re talking about lots of different tasks here. Editing, manipulation, analysis, presentation. I agree that there’s a gap in the market for a good sequence editor.

    The point of the original article is that many people are wasting time by not learning some simple scripting skills and this leads to inappropriate or just plain silly use of software. For instance, there are people who would concatenate 100 sequences by copying/pasting them into Word, whereas with a very little Linux knowledge they would know that “cat *.fa > all.fa” is the way to go. There are people who would analyse a set of sequences by submitting them one by one to a web server, copying/pasting the results to Word and editing. Whereas if they knew how, they could run the software locally and write a little Perl to submit each sequence and parse the output in an automated fashion. And so on. The issue is, why do people continue to do things badly when they know that better alternatives exist? “I don’t have time to learn Perl” is an oft-heard excuse, yet apparently some people have time to waste days, weeks or months copying/pasting.

    “Time spent now is time saved later”. Every time someone take 3 hours to do a task manually which could take a few seconds, they waste 3 hours. They could probably learn the scripting that they need for the task in 3 hours. Seems like a no-brainer to me.

  6. Amir Karger

    Hi. Thanks for linking to the Scriptome.

    As it happens, not only the Scriptome, but a large part of my job in bioinformatics support is “an attempt to make biologists think about how much time they waste doing repetitive tasks”. You’re right that a large part of the problem is biologists’ not realizing that tools for automation even exist.

    I have to take issue with your “if they bothered to learn some simple skills”, though. We actually built the Scriptome after I taught a slew of 3-hour Perl classes, classes which resulted in approximately zero new Perl users. Variable/function context, DWIMery, regexes… there’s too many side effects and syntax eccentricities that can bite newbies. (Damian Conway even wrote a paper about it, as we quoted in our article.) Not to mention the meta-concepts of how you take a problem and create a program as the solution. How much of your *first* programming language had you learned after just 3 hours? I think I was up to “20 GOTO 10” at that point.

    “Time spent now is time saved later” is a great point. But a lot of the problem is psychological: what is the *perceived* benefit against the *perceived* cost? Perceived benefit of learning programming? Might make find/replace a little faster. Perceived cost? As hard as learning Japanese.

    Which is why the Scriptome’s #1 goal is lowering the barriers (real and psychological) to starting to use the tools. A biologist can use the Scriptome to do something useful in 15 minutes (less if you or I sit next to them), possibly with no installation required.

    If it makes you happier, you can view the Scriptome as a stealth method of showing biologists the power of automation in general and Perl one-liners in specific. If we get over the psychological barrier, then eventually they’ll realize that they should (and can!) learn Perl. It even happened to our first user, who’s now a professor showing her undergrads the power of automation.

    I’ve got lots more of this, but I’ll spare you. Feel free to email me with discussion or criticism if you don’t want to post a response.

  7. nsaunders Post author

    I like the idea of scriptome as stealth method. For sure the best way to win people over is by practical demonstration.

    I’ve got lots more but I’ll spare you too. I spent years working with biologists who just would not come to the party and it infuriated me to the point where – well, I left and got an new job. When you’re excited by the possibilities, it’s hard to understand people who aren’t, no matter how hard you try to demonstrate the benefits to them.

Comments are closed.