Problematic cell lines: now in a real database

Back in July, I was complaining about the latest abuse of the word “database” by biologists: the “PDF as database.”

This led to some very productive discussion using PubMed Commons and I’m happy to report that misidentified and contaminated cell lines are now included in the NCBI BioSample database.

As the news release notes, rather alarmingly:

This problem is so common it is thought that thousands of misleading and potentially erroneous papers have been published using cell lines that are incorrectly identified

So it would be useful if there were a direct link between the BioSample record for a cell line and PubMed records in which it was used…

Unfortunately, this does not appear to be the case – or at least, I have not discovered a filter or field for “cell line” in a PubMed search. Searching by cell line name (here using the BioRuby implementation of Entrez Utilities) to count PubMed records containing the name:

#!/usr/bin/ruby

require 'bio'

Bio::NCBI.default_email = "me@me.com"
ncbi   = Bio::NCBI::REST.new

search = ncbi.esearch("cell line status misidentified[Attribute]", {"db" => "biosample", "retmax" => 500})

search.each do |id|
	record = ncbi.efetch(id, {"report" => "full", "db" => "biosample", "mode" => "text"})
	line = record.split("\n").find {|e| /\/cell line="(.*?)"/ =~ e }
	if line =~ /cell line="(.*?)"/
		pubmed = ncbi.esearch_count($1, {"db" => "pubmed"})
		puts "#{$1}\t#{pubmed}"
	end
end

which is all well and good until:

YMB-1-E	2
YMB-1	8
PSV811	5
VM-CUB-III	1
MA-111	2
MA-104	227
ECTC	35
UM-UC-3-GFP	1
PCI-22B	2
PCI-22A	0
ME-WEI	22852

the issue being that “Wei” occurs commonly in author names (and some journal names).

My PubMed skills are normally pretty good but I cannot figure out why “ME-WEI” is matching “Wei”, or how to exclude author/journal names, or for that matter why “ME-WEI[All]” returns no results when “ME-WEI” returns 22 852. Must be Monday.

5 thoughts on “Problematic cell lines: now in a real database

  1. The first time you use =~ it is in pattern =~ name and the second time it is in name =~ pattern. The latter is the proper order. That might be part of it.

    • Thanks, but that is not the problem. The Ruby code is for parsing the cell line name from the BioSample output and it works fine. My problem was defining the PubMed search terms.

    • Thanks! and also those who suggested fields on Twitter.

      So I guess that ME-WEI does not appear in abstracts. Not so surprising, since it would be unusual to mention a cell line by name in the abstract unless that was the main focus of the article. And presumably if there’s no “field match”, it just defaults to a “fuzzy match.”

  2. I should add that many cell line names are simple combinations of letters such as “RB”, so it is simply not possible to retrieve specific articles using text search. Hence the need for database links between BioSample and PubMed.

Comments are closed.