Today’s challenge. Take a look at this array, which contains the “title” field for the 6 samples from GSE1323, a series in the GEO microarray database:
['SW-480-1','SW-480-2','SW-480-3','SW-620-1','SW-620-2','SW-620-3']
Humans are very good at classification. Almost instantly, you’ll see that there are 2 classes, “SW-480″ and “SW-620″, each with 3 samples. How can we write a program to do the same job?
I’m sure that for those with formal training in computer science and algorithms, this is pretty trivial. The rest of us have to figure it out from first principles. Here’s what I did, in words:
# Imagine that you have 2 titles: "abc1" and "abc2" # Take the first character - call it the key, call the remaining characters values # this gives "a" => ["bc1", "bc2"] # Take the first 2 characters and do the same thing # "ab" => ["c1", "c2"] # Repeat until you run out of characters # "abc1" => [], "abc2" => [] # The longest key for which values exist classifies your titles # "abc" => ["1", "2"]
Here’s a Ruby implementation, using the sample titles from GSE1323.
def cluster_titles(array)
hash = {}
array.each do |title|
0.upto(title.length - 1) do |i|
(hash[title[0..i]] ||= []) << title
end
end
# longest key where value.length > 1
count = 0
hash.each_pair do |key,value|
count = key.length if count < key.length and value.length > 1
end
# delete unwanted keys
hash.delete_if { |key,value| key.length != count }
return hash
end
titles = ['SW-480-1','SW-480-2','SW-480-3','SW-620-1','SW-620-2','SW-620-3']
puts cluster_titles(titles).inspect
Result:
{
"SW-480-"=>["SW-480-1", "SW-480-2", "SW-480-3"],
"SW-620-"=>["SW-620-1", "SW-620-2", "SW-620-3"]
}


