So, we had a brief discussion regarding my previous post and clearly the statement:
The longest key for which values exist classifies your titles
does not hold true for all cases. Not that I ever said that it did! I remind you that this blog is a place for the half-formed ideas that spill out of my head, not an instruction manual.
Let’s look, for example, at GSE19318. This GEO series comprises 2 platforms: one for dog (10 samples) and one for humans (1 sample), with these sample titles:
['Dog-tumor-81705', 'Dog-tumor-78709', 'Dog-tumor-88012', 'Dog-tumor-8888302', 'Dog-tumor-209439', 'Dog-tumor-212227', 'Dog-tumor-48', 'Dog-tumor-125', 'Dog-tumor-394', 'Dog-tumor-896', 'Human-tumor']
Run that through the Ruby code in my last post and we get:
{"Dog-tumor-48"=>["Dog-tumor-48"], "Dog-tumor-81"=>["Dog-tumor-81705"], "Dog-tumor-39"=>["Dog-tumor-394"], "Dog-tumor-20"=>["Dog-tumor-209439"], "Dog-tumor-21"=>["Dog-tumor-212227"], "Dog-tumor-88"=>["Dog-tumor-88012", "Dog-tumor-8888302"], "Dog-tumor-89"=>["Dog-tumor-896"], "Dog-tumor-12"=>["Dog-tumor-125"], "Dog-tumor-78"=>["Dog-tumor-78709"]}
Whoa. That went badly wrong! However, it’s easy to see why. With only one human sample, value.length is not more than one, so that sample disappears altogether. For the dog samples, the longest key is not the key that contains all samples, due to the title naming scheme.
We might try instead to maximize the value length – that is, the array value which contains the most samples:
# longest value.length
count = 0
hash.each_pair do |key,value|
count = value.length if count < value.length
end
# delete unwanted keys
hash.delete_if { |key,value| value.length != count }
return hash
Which will give us a choice of “dog sample keys”, but still drops the human sample:
["Dog-tumo", "Do", "Dog-tumor-", "D", "Dog-tum", "Dog-t", "Dog", "Dog-tu", "Dog-tumor", "Dog-"]
Other things we might try:
- Partition sample titles by platform before trying to partition samples by title
- Don’t delete any hash keys based on key/value length; just present all options to the user
- Decide that sample partitioning by title is a poor idea and try a different approach
As ever, life would be much easier if GEO samples were titled or described in some logical, parse-able fashion.