Back in July, I was complaining about the latest abuse of the word “database” by biologists: the “PDF as database.”
This led to some very productive discussion using PubMed Commons and I’m happy to report that misidentified and contaminated cell lines are now included in the NCBI BioSample database.
As the news release notes, rather alarmingly:
This problem is so common it is thought that thousands of misleading and potentially erroneous papers have been published using cell lines that are incorrectly identified
So it would be useful if there were a direct link between the BioSample record for a cell line and PubMed records in which it was used…
Unfortunately, this does not appear to be the case – or at least, I have not discovered a filter or field for “cell line” in a PubMed search. Searching by cell line name (here using the BioRuby implementation of Entrez Utilities) to count PubMed records containing the name:
#!/usr/bin/ruby require 'bio' Bio::NCBI.default_email = "me@me.com" ncbi = Bio::NCBI::REST.new search = ncbi.esearch("cell line status misidentified[Attribute]", {"db" => "biosample", "retmax" => 500}) search.each do |id| record = ncbi.efetch(id, {"report" => "full", "db" => "biosample", "mode" => "text"}) line = record.split("\n").find {|e| /\/cell line="(.*?)"/ =~ e } if line =~ /cell line="(.*?)"/ pubmed = ncbi.esearch_count($1, {"db" => "pubmed"}) puts "#{$1}\t#{pubmed}" end end
which is all well and good until:
YMB-1-E 2 YMB-1 8 PSV811 5 VM-CUB-III 1 MA-111 2 MA-104 227 ECTC 35 UM-UC-3-GFP 1 PCI-22B 2 PCI-22A 0 ME-WEI 22852
the issue being that “Wei” occurs commonly in author names (and some journal names).
My PubMed skills are normally pretty good but I cannot figure out why “ME-WEI” is matching “Wei”, or how to exclude author/journal names, or for that matter why “ME-WEI[All]” returns no results when “ME-WEI” returns 22 852. Must be Monday.
The first time you use =~ it is in pattern =~ name and the second time it is in name =~ pattern. The latter is the proper order. That might be part of it.
Thanks, but that is not the problem. The Ruby code is for parsing the cell line name from the BioSample output and it works fine. My problem was defining the PubMed search terms.
You need to add [TIAB] after your search term to force PubMed to only report the entries that contain the string in the TItle or ABstract.
Thanks! and also those who suggested fields on Twitter.
So I guess that ME-WEI does not appear in abstracts. Not so surprising, since it would be unusual to mention a cell line by name in the abstract unless that was the main focus of the article. And presumably if there’s no “field match”, it just defaults to a “fuzzy match.”
I should add that many cell line names are simple combinations of letters such as “RB”, so it is simply not possible to retrieve specific articles using text search. Hence the need for database links between BioSample and PubMed.