Clustering GEO samples by title (briefly) revisited

So, we had a brief discussion regarding my previous post and clearly the statement:

The longest key for which values exist classifies your titles

does not hold true for all cases. Not that I ever said that it did! I remind you that this blog is a place for the half-formed ideas that spill out of my head, not an instruction manual.

Let’s look, for example, at GSE19318. This GEO series comprises 2 platforms: one for dog (10 samples) and one for humans (1 sample), with these sample titles:

['Dog-tumor-81705', 'Dog-tumor-78709', 'Dog-tumor-88012', 'Dog-tumor-8888302', 'Dog-tumor-209439', 'Dog-tumor-212227', 'Dog-tumor-48', 'Dog-tumor-125', 'Dog-tumor-394', 'Dog-tumor-896', 'Human-tumor']

Run that through the Ruby code in my last post and we get:

{"Dog-tumor-48"=>["Dog-tumor-48"], "Dog-tumor-81"=>["Dog-tumor-81705"], "Dog-tumor-39"=>["Dog-tumor-394"], "Dog-tumor-20"=>["Dog-tumor-209439"], "Dog-tumor-21"=>["Dog-tumor-212227"], "Dog-tumor-88"=>["Dog-tumor-88012", "Dog-tumor-8888302"], "Dog-tumor-89"=>["Dog-tumor-896"], "Dog-tumor-12"=>["Dog-tumor-125"], "Dog-tumor-78"=>["Dog-tumor-78709"]}

Whoa. That went badly wrong! However, it’s easy to see why. With only one human sample, value.length is not more than one, so that sample disappears altogether. For the dog samples, the longest key is not the key that contains all samples, due to the title naming scheme.

We might try instead to maximize the value length – that is, the array value which contains the most samples:

  # longest value.length
  count = 0
  hash.each_pair do |key,value|
    count = value.length if count < value.length
  end
  # delete unwanted keys
  hash.delete_if { |key,value| value.length != count }
  return hash

Which will give us a choice of “dog sample keys”, but still drops the human sample:

["Dog-tumo", "Do", "Dog-tumor-", "D", "Dog-tum", "Dog-t", "Dog", "Dog-tu", "Dog-tumor", "Dog-"]

Other things we might try:

  • Partition sample titles by platform before trying to partition samples by title
  • Don’t delete any hash keys based on key/value length; just present all options to the user
  • Decide that sample partitioning by title is a poor idea and try a different approach

As ever, life would be much easier if GEO samples were titled or described in some logical, parse-able fashion.

3 thoughts on “Clustering GEO samples by title (briefly) revisited

  1. Alex Ishkin

    I’ve got one more idea for this problem. Sample names often follow the pattern -. For example, your dog tumor data set is clearly of this type (sample name consists of group identifier – “Dog tumor” and some number, separated by dash). We can suppose that dash and space are used commonly as such separators in sample names (I can’t imagine a guy who would use tab character in sample name). So, we check whether sample names contain dashes or spaces and then try to obtain proper partitioning by cutting out substrings from the end of sample names.
    Let’s look at GSE13918. If we split the sample name using dash as separator and then throw away the last element of resulting vector, we’ll directly get proper sample groups, termed “Dog-tumor” and “Human”.
    I’d try to write down some code for this, but text processing in R is a real hassle, so the code would look rather ugly.

    1. Alex Ishkin

      Oh sorry, I didn’t know that any text in angle brackets was considered as HTML tag. In my previous comment I meant “Sample names often follow the pattern [group_identifier]-[sample number]”

Comments are closed.