While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.
What the world needs is: lists of Entrez database fields
You know the problem. You want to qualify your NCBI/Entrez database search term using a field. For example: “autism[TIAB]“, to search PubMed for the word autism in either Title or Abstract. Problem – you can’t find a list of fields specific to that database.
Now you can. Follow the links in this public Dropbox file, to see a CSV file containing name, full name and description of the fields for each Entrez database.
Code to generate the files is listed below. This may or may not be the first in an occasional, irregular “what the world needs” series.
#!/usr/bin/ruby
require 'rubygems'
require 'bio'
require 'hpricot'
require 'open-uri'
Bio::NCBI.default_email = "me@me.com"
ncbi = Bio::NCBI::REST.new
ncbi.einfo.each do |db|
puts "Processing #{db}..."
File.open("#{db}.txt", "w") do |f|
doc = Hpricot(open("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=#{db}"))
(doc/'//fieldlist/field').each do |field|
name = (field/'/name').inner_html
fullname = (field/'/fullname').inner_html
description = (field/'description').inner_html
f.write("#{name},#{fullname},#{description}\n")
end
end
end
Samples per series/dataset in the NCBI GEO database
I want to get an NCBI GEO report showing the number of samples per series or data set. Short of downloading all of GEO, anyone know how to do this? Is there a table of just metadata hidden somewhere?
At work, we joke that GEO is the only database where data goes in, but it won’t come out. However, there is an alternative: the GEOmetadb package, available from Bioconductor.
The R code first, then some explanation:
# install GEOmetadb
source("http://bioconductor.org/biocLite.R")
biocLite("GEOmetadb")
library(GEOmetadb)
# connect to database
getSQLiteFile()
con <- dbConnect(SQLite(), "GEOmetadb.sqlite")
# count samples per GDS
gds.count <- dbGetQuery(con, "select gds,sample_count from gds")
gds.count[1:5,]
# first 5 results
gds sample_count
1 GDS5 5
2 GDS6 29
3 GDS10 28
4 GDS12 8
5 GDS15 6
# count samples per GSE
gse <- dbGetQuery(con, "select series_id from gsm")
gse.count <- as.data.frame(table(gse$series_id))
gse.count[1:10,]
# first 10 results
Var1 Freq
1 GSE1 38
2 GSE10 4
3 GSE100 4
4 GSE10000 29
5 GSE10001 12
6 GSE10002 8
7 GSE10003 4
8 GSE10004,GSE10114 3
9 GSE10005 48
10 GSE10006 75
We install GEOmetadb (lines 2-4), then download and unpack the SQLite database (line 7). This generates the file ~/GEOmetadb.sqlite, which is currently a little over 1 GB.
Next, we connect to the database via RSQLite (lines 7-8). The gds table contains GDS dataset accession and sample count, so extracting that information is very easy (line 11).
GSE series are a little different. The gsm table contains GSM sample accession and GSE series accession (in the series_id field). We can count up the samples per series using table(), on line 22. However, this generates some odd-looking results, such as:
Var1 Freq
15 GSE10011,GSE10026 45
14652 GSE9973,GSE10026 9
14654 GSE9975,GSE10026 36
14656 GSE9977,GSE10026 24
Fear not. In this case, GSE10026 is a super-series comprised from the series GSE10011 (45 samples), GSE9973 (9 samples), GSE9975 (36 samples) and GSE9977 (24 samples), total = 114 samples.
APIs: I wish the life sciences would learn from social networks
I was prompted by a thread on the apparent decline of FriendFeed to look for evidence of declining participation in my networks.
Read the rest…
Querying NCBI Entrez database fields using Ruby
Here’s a problem. You’d like to construct a complex query at NCBI Entrez using various fields. Example:
“9606″[Taxonomy ID]
to limit your search to Homo sapiens. Except – you don’t know which fields are available for the database that you want to query.
Read the rest…
15 year-old error results in improved performance?
Here’s an interesting letter in the current issue of Nature Biotechnology (subscription only):
In the course of analyzing the evolution of the Blocks database2, we noticed errors in the software source code used to create the initial BLOSUM family of matrices [...] The result of these errors is that the BLOSUM matrices—BLOSUM62, BLOSUM50, etc.—are quite different from the matrices that should have been calculated using the algorithm described by Henikoff and Henikoff. Obviously, minor errors in research, and particularly in software source code, are quite common. This case is noteworthy for three reasons: first, the BLOSUM matrices are ubiquitous in computational biology; second, these errors have gone unnoticed for 15 years; and third, the ‘incorrect’ matrices perform better than the ‘intended’ matrices.
Are they right? Does it matter?


