Samples per series/dataset in the NCBI GEO database

Andrew asks:

I want to get an NCBI GEO report showing the number of samples per series or data set. Short of downloading all of GEO, anyone know how to do this? Is there a table of just metadata hidden somewhere?

At work, we joke that GEO is the only database where data goes in, but it won’t come out. However, there is an alternative: the GEOmetadb package, available from Bioconductor.

The R code first, then some explanation:

# install GEOmetadb
source("http://bioconductor.org/biocLite.R")
biocLite("GEOmetadb")
library(GEOmetadb)

# connect to database
getSQLiteFile()
con <- dbConnect(SQLite(), "GEOmetadb.sqlite")

# count samples per GDS
gds.count <- dbGetQuery(con, "select gds,sample_count from gds")
gds.count[1:5,]
# first 5 results
     gds sample_count
1   GDS5            5
2   GDS6           29
3  GDS10           28
4  GDS12            8
5  GDS15            6
# count samples per GSE
gse <- dbGetQuery(con, "select series_id from gsm")
gse.count <- as.data.frame(table(gse$series_id))
gse.count[1:10,]
# first 10 results
                Var1 Freq
1               GSE1   38
2              GSE10    4
3             GSE100    4
4           GSE10000   29
5           GSE10001   12
6           GSE10002    8
7           GSE10003    4
8  GSE10004,GSE10114    3
9           GSE10005   48
10          GSE10006   75

We install GEOmetadb (lines 2-4), then download and unpack the SQLite database (line 7). This generates the file ~/GEOmetadb.sqlite, which is currently a little over 1 GB.

Next, we connect to the database via RSQLite (lines 7-8). The gds table contains GDS dataset accession and sample count, so extracting that information is very easy (line 11).

GSE series are a little different. The gsm table contains GSM sample accession and GSE series accession (in the series_id field). We can count up the samples per series using table(), on line 22. However, this generates some odd-looking results, such as:

          Var1          Freq
15    GSE10011,GSE10026 45
14652 GSE9973,GSE10026   9
14654 GSE9975,GSE10026  36
14656 GSE9977,GSE10026  24

Fear not. In this case, GSE10026 is a super-series comprised from the series GSE10011 (45 samples), GSE9973 (9 samples), GSE9975 (36 samples) and GSE9977 (24 samples), total = 114 samples.

3 thoughts on “Samples per series/dataset in the NCBI GEO database

  1. Alex Ishkin

    Thank you Neil! That’s interesting approach, and it seems to be one more piece in the putative pipeline for automated processing of GEO data (I’m pretty fed up with searching relevant data sets in the web, picking those with raw data, downloading CELs… and finally describing what is it all about). The most prominent shortage, though, is still there – guys who wrote GEOmetadb haven’t ever tried to cope with infamous mess in sample annotations of GEO. What is the advantage of having sample metadata at hand in R if you sometimes have to dig into original paper to understand: what was the experiment design? what is compared to what? I usually need some differential expression analysis (both at gene and gene set levels), and revealing experimental design of some GSE series is real pain.

    1. nsaunders Post author

      I can only say that I feel your pain. “Standards” at GEO appears to mean “we allow an arbitrary number of fields, filled (or not) with arbitrary values.” The GDS datasets are a little better – they often have a covariate named something like “disease.state” – but a lot of the interesting stuff is in series that haven’t made it to datasets.

      1. Alex Ishkin

        GDS part also does not have solid underlying ontology of experimental factors. Moreover, it seems that growth rate of whole GEO is now higher than of GDS subset. Therefore, we’ll always have to deal with series. BTW, it can be shown with GEOmetadb – something like plot showing dependence of GSE/GDS count on time.

Comments are closed.