Category Archives: computing

Ruby Version Manager: the absolute basics

Confession: I’m still using Ruby version 1.8.7, from the Ubuntu 10.10 repository. Why? I have a lot of working code, I don’t know how well it works using Ruby 1.9 and I’m worried that migration will break things and make me miserable.

Various people have pointed me to RVM – Ruby Version Manager. As the name suggests, it allows you to manage multiple Ruby versions on your machine. Today, I needed to test an application written for Ruby 1.9.2, so I used RVM for the first time. Here are the absolute basics, for anyone who just wants to test some code using Ruby 1.9, without messing up their existing system:

# install rvm
bash < <( curl http://rvm.beginrescueend.com/releases/rvm-install-head )
# add it to your .bashrc
echo '[[ -s "$HOME/.rvm/scripts/rvm" ]] && source "$HOME/.rvm/scripts/rvm"' >> ~/.bashrc
# add tab completion to your .bashrc
echo '[[ -r $rvm_path/scripts/completion ]] && . $rvm_path/scripts/completion' >> ~/.bashrc
# source the script; check installation
source ~/.rvm/scripts/rvm
rvm -v
# returns (e.g.) rvm 1.2.8 by Wayne E. Seguin (wayneeseguin@gmail.com) [http://rvm.beginrescueend.com/]
# install ruby 1.9.2
rvm install 1.9.2
# use ruby 1.9.2
rvm use 1.9.2
ruby -v
# returns (e.g.) ruby 1.9.2p180 (2011-02-18 revision 30909) [i686-linux]
# do stuff as normal (install gems etc.)
# when you're done, switch back to system ruby
rvm system
ruby -v
# returns (e.g.) ruby 1.8.7 (2010-06-23 patchlevel 299) [i686-linux]

That’s it. The key thing is that RVM sets up Ruby in $HOME/.rvm so for example, when using version 1.9.2 under RVM, gem install GEMNAME will install to $HOME/.rvm/gems/ruby-1.9.2-p180/gems/. Your system files are untouched.

The RStudio IDE: first impressions are positive

Integrated development environments (IDEs) are software development tools, providing an interface that enables you to write, debug, run and view the output of your code.

Whether you need an IDE or find them useful depends very much on your own preferences and style of working. In my own case for example, I’ve tried both Eclipse and NetBeans, but I find them bloated and rather “overkill”. On the other hand, my LaTeX productivity shot up when I started to use Kile.

Most of my coding involves either Ruby or R, written using Emacs. For Ruby (including Rails), I use a bundled set of plugins named my_emacs_for_rails, which includes the Emacs Code Browser (ECB). For R, I occasionally use Emacs Speaks Statistics (ESS), but I’m just as likely to run code from a terminal or use the R console.

RStudio, released yesterday, is a new open-source IDE for R. It’s getting a lot of attention at R-bloggers and it’s easy to see why: this is open-source software development done right.
Read the rest…

Getting “stuff” into MongoDB

One of the aspects I like most about MongoDB is the “store first, ask questions later” approach. No need to worry about table design, column types or constant migrations as design changes. Provided that your data are in some kind of hash-like structure, you just drop them in.

Ruby is particularly useful for this task, since it has many gems which can parse common formats into a hash. Here are 3 quick examples with relevance to bioinformatics.
Read the rest…

Dumped on by data scientists

A story in The Chronicle of Higher Education reminded me that I’ve been meaning to write about “data science” for some time.

The headline to the story:

“Dumped On by Data: Scientists Say a Deluge Is Drowning Research”

Rather amusingly, this is abbreviated in the URL to “Dumped-On-by-Data-Scientists”; a nice example of how the same words, broken in the wrong place, can lead to a completely different meaning.

Anyway, to the point. The term “data scientist” – a good thing, or not?
I’m throwing this one out there because I spent much of 2010 (a) reading articles that used the term and (b) trying to decide whether I like it or not – and I still can’t decide.

Arguments for:

  • It’s an attention-grabber, designed to make us think about the tools and skills required to analyse “big data” in the same way that “NoSQL” is designed to make us think about alternative database solutions

Arguments against:

  • The “data” part is redundant, since all scientists deal with data
  • It belittles the job title of “scientist”; the term might be construed as dismissive of the education, training and skills required to do “boring old school science” as opposed to “new, flashy sexy data science”
  • Many (most?) “data scientists” do business intelligence, not science; crunching Twitter posts to help formulate a better product marketing strategy is not the same as addressing a genuine scientific problem

At the heart of the issue, I feel, is a different approach to data. In “data science” we start with everything, give it a shake and see if answers to our questions fall out. In “real science” we start with a specific question, generate data designed to answer that question and see what falls out. Perhaps they are just different philosophies and mindsets. Perhaps each can learn from the other.

I guess with one “for” and three “against” I’ve decided that I don’t like the term “data scientist”, but I can’t quite shake the feeling that it has some use. What do you think?

Algorithms running day and night

Warning: contains murky, somewhat unstructured thoughts on large-scale biological data analysis

Picture this. It’s based on a true story: names and details altered.

Alice, a biomedical researcher, performs an experiment to determine how gene expression in cells from a particular tissue is altered when the cells are exposed to an organic compound, substance Y. She collates a list of the most differentially-expressed genes and notes, in passing, that the expression of Gene X is much lower in the presence of substance Y.

Bob, a bioinformatician in the same organisation but in a different city to Alice, is analysing a public dataset. This experiment looks at gene expression in the same tissue but under different conditions: normal compared with a disease state, Z Syndrome. He also notes that Gene X appears in his list – its expression is much higher in the diseased tissue.

Alice and Bob attend the annual meeting of their organisation, where they compare notes and realise the potential significance of substance Y in suppressing the expression of Gene X and so perhaps relieving the symptoms of Z syndrome. On hearing this the head of the organisation, Charlie, marvels at the serendipitous nature of the discovery. Surely, he muses, given the amount of publicly-available experimental data, there must be a way to automate this kind of discovery by somehow “cross-correlating” everything with everything else until patterns emerge. What we need, states Charlie, is:

Algorithms running day and night, crunching all of that data

What’s Charlie missing?
Read the rest…

APIs have let me down part 2/2: FriendFeed

In part 1, I described some frustrations arising out of a work project, using the Array Express API. I find that one way to deal mentally with these situations is to spend some time on a fun project, using similar programming techniques. A potential downside of this approach is that if your fun project goes bad, you’re really frustrated. That’s when it’s time to abandon the digital world, go outside and enjoy nature.

Here then, is why I decided to build another small project around FriendFeed, how its failure has led me to question the value of FriendFeed for the first time and why my time as a FriendFeed user might be up.
Read the rest…

APIs have let me down part 1/2: ArrayExpress

The API – Application Programming Interface – is, in principle, a wonderful thing. You make a request to a server using a URL and back come lovely, structured data, ready to parse and analyse. We’ve begun to demand that all online data sources offer an API and lament the fact that so few online biological databases do so.

Better though, to have no API at all than one which is poorly implemented and leads to frustration? I’m beginning to think so, after recent experiences on both a work project and one of my “fun side projects”. Let’s start with the work project, an attempt to mine a subset of the ArrayExpress microarray database.
Read the rest…

Does your LinkedIn Map say anything useful?

LinkedIn, the “professional” career-oriented social network, is one of those places on the Web where I maintain a profile for visibility. I’m yet to gain any practical value whatsoever from it. That said, I know plenty of people who do find it useful – mostly, it seems, those living near the north-east or west coast of the USA.

inmap

My LinkedIn Network


LinkedIn have something of a reputation for innovation – see LinkedIn Labs, their small demonstration products, for example. The latest of these is named InMaps. It’s been popping up on blogs and Twitter for several days. Essentially, it creates a graph of your LinkedIn network, applies some community detection algorithm to cluster the members and displays the results as a pretty, interactive graphic that you can share.

What seems to have captured the imagination is that the graphs indicate communities that are instantly recognisable to the user. There’s mine on the right (click for full-size version). It’s not a large, complex or especially interesting network but when I “eyeballed” it, I was immediately able to classify the three sub-graphs:

  • Orange – mostly people with whom I have worked or currently work, plus a few “random” contacts: note that this group is hardly interconnected at all
  • Green – people who work in bioinformatics or computational biology, particularly genomics: two major hubs connect me with this group
  • Blue – the largest, densest network is composed largely of what I’d call the “BioGang”: people that I interact with on Twitter and FriendFeed, many of whom I haven’t met in person

This confirms what I’ve long suspected: I prefer to network with smart strangers than my immediate peers and colleagues. Or as Bill Joy said, “no matter who you are, most of the smartest people work for someone else.” I’ve seen this misquoted as “where you are”, which makes more sense to me.

A timely reminder to use strong passwords

You may have read about a security breach at Gawker Media, the company behind several websites including Lifehacker.

The server files have been posted at various locations around the web, so I thought I’d take a look. Finding your own email address and decrypted password in a file obtained online is a sobering experience, I can tell you. Fortunately, it was not a password that I use elsewhere, so no damage done. It was, however, a ridiculously “soft” password (all digits, if you must know).

Of course, my thoughts soon turned to data analysis. A quick and dirty bash one-liner reveals the top 10 passwords…
Read the rest…

Can a journal make a difference? Let’s find out.

Academic journals. Frankly, I’m not a big fan of any of them. There are too many. They cost too much. Much of what they publish is inconsequential, read by practically no-one or just downright incorrect. Much of the rest is badly-written and boring. The people who publish them have an over-inflated sense of their own importance. They’re hidden behind paywalls. And governed by ludicrous metrics. The system by which articles are accepted or rejected is arcane and ridiculous. I mean, I could go on…

No, what really troubles me about journals is that they only tell a very small part of the story – the flashy, attention-grabbing part called “results”. We learn from high school onwards that a methods section should be sufficient for anyone to reproduce the results. This is one of the great lies of science. Go read any journal in your field and give it a try. It’s even the case in computation, an area which you might think less prone to the problems in reproducing wet-lab science (“the Milli-Q must have been off”).

We have this wonderful thing called the Web now. The Web doesn’t have a page limit, so you can describe things in as much detail as you wish. Better still, you can just post your methods and data there in full, for all to see, download and reproduce to their hearts content. You’d like some credit for doing that though, right?

So if you do research – any kind of research – that involves computation, your code is open-source, reusable, well-documented and robust (think: tests) and you want to share it with the world, head over to a new journal called BMC Open Research Computation, which is now open for submissions. Your friendly team of enlightened editors awaits.

More information at Science in the Open and Saaien Tist. Full disclosure: I’m on the editorial board of this journal and was invited to write a launch post.