## “Thinking algorithmically”: clustering GEO samples by title

Today’s challenge. Take a look at this array, which contains the “title” field for the 6 samples from GSE1323, a series in the GEO microarray database:

```['SW-480-1','SW-480-2','SW-480-3','SW-620-1','SW-620-2','SW-620-3']
```

Humans are very good at classification. Almost instantly, you’ll see that there are 2 classes, “SW-480″ and “SW-620″, each with 3 samples. How can we write a program to do the same job?

I’m sure that for those with formal training in computer science and algorithms, this is pretty trivial. The rest of us have to figure it out from first principles. Here’s what I did, in words:

```# Imagine that you have 2 titles:  "abc1" and "abc2"
# Take the first character - call it the key, call the remaining characters values
# this gives "a" => ["bc1", "bc2"]
# Take the first 2 characters and do the same thing
# "ab" => ["c1", "c2"]
# Repeat until you run out of characters
# "abc1" => [], "abc2" => []
# The longest key for which values exist classifies your titles
# "abc" => ["1", "2"]
```

Here’s a Ruby implementation, using the sample titles from GSE1323.

```def cluster_titles(array)
hash = {}
array.each do |title|
0.upto(title.length - 1) do |i|
(hash[title[0..i]] ||= []) << title
end
end
# longest key where value.length > 1
count = 0
hash.each_pair do |key,value|
count = key.length if count < key.length and value.length > 1
end
# delete unwanted keys
hash.delete_if { |key,value| key.length != count }
return hash
end

titles = ['SW-480-1','SW-480-2','SW-480-3','SW-620-1','SW-620-2','SW-620-3']
puts cluster_titles(titles).inspect
```

Result:

```{
"SW-480-"=>["SW-480-1", "SW-480-2", "SW-480-3"],
"SW-620-"=>["SW-620-1", "SW-620-2", "SW-620-3"]
}
```

### 4 Comments to ““Thinking algorithmically”: clustering GEO samples by title”

1. Of course, this approach does not work in all cases. There are GSE series that contain more than one platform and sample titles that do not partition correctly (such as GSE19318).

2. I suppose that parsing sample characteristics would beneficial in more cases (not always, though).

Your approach is a bit risky – imagine samples called “SW-480-10″
“SW-480-11″
“SW-480-12″

“SW-480-20″
“SW-480-21″
“SW-480-22″

Your pseudocode supposes that these samples are going to be divided into 2 groups – and it seems to be wrong.
In the common case, you’ll either have to write a separate parser for each convention of sample title composition or set a design as an independent data structure manually (kind of phenoData matrix in R).
So, unfortunately, for GEO it’s nothing more as an example of algorithmic thinking.

• Yes, as I pointed out in the first comment, this approach does not work very well for other series – actually, many other series.

Another approach is not to delete any keys and simply present the user with each set of key/value pairs, then let them choose. To be honest though, partitioning by title just doesn’t work well.

3. We have a system that auto-clusters GSMs for a GSE into biologically coherent groups (e.g. “treated” vs. “untreated”) and auto-tags them with ontology terms (I will publish this whenever I can find the time). I can tell you that you are going to need much more than the titles to get anything usable. Sometimes I swear there are cryptographic messages hidden in those GSM annotations! The samples will cluster in complex hierarchies, so we do some traversal of the clustering structure to see where the groups are.