Clustering GEO samples by title (briefly) revisited

So, we had a brief discussion regarding my previous post and clearly the statement:

The longest key for which values exist classifies your titles

does not hold true for all cases. Not that I ever said that it did! I remind you that this blog is a place for the half-formed ideas that spill out of my head, not an instruction manual.

Let’s look, for example, at GSE19318. This GEO series comprises 2 platforms: one for dog (10 samples) and one for humans (1 sample), with these sample titles:

['Dog-tumor-81705', 'Dog-tumor-78709', 'Dog-tumor-88012', 'Dog-tumor-8888302', 'Dog-tumor-209439', 'Dog-tumor-212227', 'Dog-tumor-48', 'Dog-tumor-125', 'Dog-tumor-394', 'Dog-tumor-896', 'Human-tumor']

Run that through the Ruby code in my last post and we get:

{"Dog-tumor-48"=>["Dog-tumor-48"], "Dog-tumor-81"=>["Dog-tumor-81705"], "Dog-tumor-39"=>["Dog-tumor-394"], "Dog-tumor-20"=>["Dog-tumor-209439"], "Dog-tumor-21"=>["Dog-tumor-212227"], "Dog-tumor-88"=>["Dog-tumor-88012", "Dog-tumor-8888302"], "Dog-tumor-89"=>["Dog-tumor-896"], "Dog-tumor-12"=>["Dog-tumor-125"], "Dog-tumor-78"=>["Dog-tumor-78709"]}

Whoa. That went badly wrong! However, it’s easy to see why. With only one human sample, value.length is not more than one, so that sample disappears altogether. For the dog samples, the longest key is not the key that contains all samples, due to the title naming scheme.

We might try instead to maximize the value length – that is, the array value which contains the most samples:

  # longest value.length
  count = 0
  hash.each_pair do |key,value|
    count = value.length if count < value.length
  # delete unwanted keys
  hash.delete_if { |key,value| value.length != count }
  return hash

Which will give us a choice of “dog sample keys”, but still drops the human sample:

["Dog-tumo", "Do", "Dog-tumor-", "D", "Dog-tum", "Dog-t", "Dog", "Dog-tu", "Dog-tumor", "Dog-"]

Other things we might try:

  • Partition sample titles by platform before trying to partition samples by title
  • Don’t delete any hash keys based on key/value length; just present all options to the user
  • Decide that sample partitioning by title is a poor idea and try a different approach

As ever, life would be much easier if GEO samples were titled or described in some logical, parse-able fashion.

“Thinking algorithmically”: clustering GEO samples by title

Today’s challenge. Take a look at this array, which contains the “title” field for the 6 samples from GSE1323, a series in the GEO microarray database:


Humans are very good at classification. Almost instantly, you’ll see that there are 2 classes, “SW-480” and “SW-620”, each with 3 samples. How can we write a program to do the same job?
Read the rest…

Samples per series/dataset in the NCBI GEO database

Andrew asks:

I want to get an NCBI GEO report showing the number of samples per series or data set. Short of downloading all of GEO, anyone know how to do this? Is there a table of just metadata hidden somewhere?

At work, we joke that GEO is the only database where data goes in, but it won’t come out. However, there is an alternative: the GEOmetadb package, available from Bioconductor.

The R code first, then some explanation:

# install GEOmetadb

# connect to database
con <- dbConnect(SQLite(), "GEOmetadb.sqlite")

# count samples per GDS
gds.count <- dbGetQuery(con, "select gds,sample_count from gds")
# first 5 results
     gds sample_count
1   GDS5            5
2   GDS6           29
3  GDS10           28
4  GDS12            8
5  GDS15            6
# count samples per GSE
gse <- dbGetQuery(con, "select series_id from gsm")
gse.count <-$series_id))
# first 10 results
                Var1 Freq
1               GSE1   38
2              GSE10    4
3             GSE100    4
4           GSE10000   29
5           GSE10001   12
6           GSE10002    8
7           GSE10003    4
8  GSE10004,GSE10114    3
9           GSE10005   48
10          GSE10006   75

We install GEOmetadb (lines 2-4), then download and unpack the SQLite database (line 7). This generates the file ~/GEOmetadb.sqlite, which is currently a little over 1 GB.

Next, we connect to the database via RSQLite (lines 7-8). The gds table contains GDS dataset accession and sample count, so extracting that information is very easy (line 11).

GSE series are a little different. The gsm table contains GSM sample accession and GSE series accession (in the series_id field). We can count up the samples per series using table(), on line 22. However, this generates some odd-looking results, such as:

          Var1          Freq
15    GSE10011,GSE10026 45
14652 GSE9973,GSE10026   9
14654 GSE9975,GSE10026  36
14656 GSE9977,GSE10026  24

Fear not. In this case, GSE10026 is a super-series comprised from the series GSE10011 (45 samples), GSE9973 (9 samples), GSE9975 (36 samples) and GSE9977 (24 samples), total = 114 samples.