APIs have let me down part 1/2: ArrayExpress

The API – Application Programming Interface – is, in principle, a wonderful thing. You make a request to a server using a URL and back come lovely, structured data, ready to parse and analyse. We’ve begun to demand that all online data sources offer an API and lament the fact that so few online biological databases do so.

Better though, to have no API at all than one which is poorly implemented and leads to frustration? I’m beginning to think so, after recent experiences on both a work project and one of my “fun side projects”. Let’s start with the work project, an attempt to mine a subset of the ArrayExpress microarray database.

1. Introduction
ArrayExpress is an online database of microarray experiments, organised by both gene (the expression atlas) and experiment (the experiment archive). As biological websites go it’s quite impressive, featuring well-implemented search and some nice graphical summaries of the results.

Sometimes though, what you want is a specific subset of the data on your own machine, for statistical analysis. In my case, I was interested in data pertaining to colorectal cancer. I also wanted to implement some visualisations of the data using tools such as Sinatra, jQuery UI and Highcharts. The ArrayExpress website uses some Javascript for charts, table sorting and loading data using AJAX but, in my opinion, it’s somewhat clunky and could use a design overhaul.

2. The API
Unusually, for a biological database, ArrayExpress provides an API, returning results in either JSON or XML format. This is terrific – way ahead of most other online resources for biology – and for small, simple queries, works very well. The documentation leaves something to be desired: outdated in places and somewhat “rambling” but with patience, study and experimentation, you can generally figure out what you want to do.

3. Retrieving gene data
Fetching data in JSON format for human genes identified in experiments pertaining to colorectal cancer is relatively straightforward, using this URL:


That fetches the first 100 results and it’s quite easy to write a loop, using the key totalResults, to retrieve the rest as described in the documentation. EFO_0000365 is an ontology terms for colorectal neoplasms, another nice feature of ArrayExpress.

The JSON contains a lot of gene-specific information: mostly names, accessions and IDs for different databases. The key expressions => experiments looks something like this:

			"experiments" : [
					"accession" : "E-GEOD-15960",
					"pvalue" : 0.000006433492,
					"expression" : "UP"
					"accession" : "E-AFMX-5",
					"pvalue" : 0.017802352,
					"expression" : "UP"
					"accession" : "E-TABM-145",
					"pvalue" : 0.004811193,
					"expression" : "UP"
					"accession" : "E-MTAB-62",
					"pvalue" : 0.000009164466,
					"expression" : "UP"
					"accession" : "E-MTAB-62",
					"pvalue" : 0.000005601625,
					"expression" : "UP"
This is rather confusing. You can see that the same gene may appear multiple times in the same experiment (in this case, E-MTAB-62), with different p-values. These arise from comparing within different experimental factors – which are not defined in the JSON output.

Nevertheless, we’ve made a start and can create a simple web application. I won’t go into the details here: instead, at right, a couple of screenshots (click for full-size), showing how jQuery UI can create auto-suggest for gene names and a tabbed view for gene annotations.


Gene name auto-suggest


Gene annotation tabbed view

4. Retrieving experiment data
Things start to get messy when we retrieve experimental data. My MongoDB database contains entries for 14 681 genes, but these arise from only 11 experiments. In principle, the data for each gene from each experiment can be retrieved using the API by combining each experiment-gene ID pair like so:


Parse, save to database, create Sinatra view with a tab for each experimental factor and the corresponding gene/probe values. Too easy, right?
In practice, this proves almost impossible due to errors and other deficiencies in the API. In no particular order these include:

  • Timeout errors
    Timeouts are relatively common, but intermittent:

    /usr/lib/ruby/1.8/timeout.rb:64:in `rbuf_fill': execution expired (Timeout::Error)

    Yet mysteriously the very next query, spaced to occur 5 seconds later – no problem at all. On a possibly related note, access to the EBI FTP server is incredibly slow. I don’t know if that relates to my location.

  • JSON parsing errors
    Occasionally, errors like this one appear:

    /usr/lib/ruby/1.8/json/pure/parser.rb:105:in `parse': source ',{"error":"Exception' not in JSON! (JSON::ParserError)

    This implies that the server did not return a JSON string. Why? I don’t know.

  • Other miscellaneous, random and bizarre errors
    Once, to date, a long message implying that some crucial Java component in the Apache Tomcat stack was missing. Yet the very next query, 5 seconds later – no problem.
  • JSON data differing to that at the website
    Occasionally, an error is returned stating that there is no record for the given experiment-gene pair. This seems odd, given that the gene was associated with the experiment in the earlier gene query.

    Commonly, there are discrepancies between the experiment as viewed at the ArrayExpress website and the JSON content. On the right, results from the website for gene MIMAT0000097, experiment E-TABM-184.

    Below, the (complete) JSON returned for the same experiment-gene combination:

    	"_id" : "E-TABM-184/MIMAT0000097",
    	"experimentInfo" : {
    		"accession" : "E-TABM-184",
    		"description" : "microRNA profiling of 191 human cancer samples identifies ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas.",
    		"pubmedId" : 17785203
    	"experimentDesign" : {
    		"experimentalFactors" : [
    	"experimentOrganisms" : [
    		"Mus musculus",
    		"Homo sapiens",
    		"Arabidopsis thaliana"

    Experiment-gene view at ArrayExpress website

  • Insufficient granularity in available queries
    Some experiments are large, involving thousands of samples and return several megabytes of data, even for one gene query. This can be a problem, not least because of the 4 MB document limit in MongoDB.
    Ideally given a large, complex key-value (hash-like) structure, one would like to be able to specify which parts of the hash to return by key. This is currently not possible: the list of qualifying terms for an experiment query is limited to experimental factors, keywords or gene identifiers.

Fetching large amounts of experimental data from ArrayExpress using the API is, at present, impracticable. This is in part apparently due to performance and design issues with the API, but also raises questions about whether RESTful APIs are appropriate for complex data where individual items may be tens or hundreds of megabytes. Perhaps it’s time for a return to simple download from the FTP site – where issues of speed and file documentation would themselves warrant another blog post…

On the bright side, the EBI has a good support team. If you happen to be a member of it and you’re reading this, don’t comment – I’ll be calling you! This post is, in fact, an attempt to document my problems before doing just that.

2 thoughts on “APIs have let me down part 1/2: ArrayExpress

  1. GXA Team


    We’re looking into the issues with our APIs with great detail – and we agree with you – there are many concerns to be addressed here. Basically what we did is expose our internal APIs for external consumption, and this, of course, engenders other problems. We’re working on sorting this out – your feedback is invaluable. Apologies for any inconvenience caused!

    We’ll try to get better.


    The Atlas Team

    1. nsaunders Post author

      Thanks for your very gracious comment; I know that the preferred way to highlight these issues is to submit bug reports, rather than ranting in blog posts. Good APIs are very difficult to deploy. I look forward to future developments.

Comments are closed.