Tag Archives: ncbi

Web scraping using Mechanize: PMID to PMCID/NIHMSID

Web services are great. Pass them a URL. Structured data comes back. Parse it, analyse it, visualise it. Done.

Web scraping – interacting programmatically with a web page – is not so great. It requires more code and when the web page changes, the code breaks. However, in the absence of a web service, scraping is better than nothing. It can even be rather satisfying. Early in my bioinformatics career the realisation that code, rather than humans, can automate the process of submitting forms and reading the results was quite a revelation.

In this post: how to interact with a web page at the NCBI using the Mechanize library.

Read the rest…

What the world needs is: lists of Entrez database fields

You know the problem. You want to qualify your NCBI/Entrez database search term using a field. For example: “autism[TIAB]“, to search PubMed for the word autism in either Title or Abstract. Problem – you can’t find a list of fields specific to that database.

Now you can. Follow the links in this public Dropbox file, to see a CSV file containing name, full name and description of the fields for each Entrez database.

Code to generate the files is listed below. This may or may not be the first in an occasional, irregular “what the world needs” series.

require 'rubygems'
require 'bio'
require 'hpricot'
require 'open-uri'

Bio::NCBI.default_email = "me@me.com"
ncbi = Bio::NCBI::REST.new

ncbi.einfo.each do |db|
  puts "Processing #{db}..."
  File.open("#{db}.txt", "w") do |f|
    doc = Hpricot(open("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=#{db}"))
    (doc/'//fieldlist/field').each do |field|
      name = (field/'/name').inner_html
      fullname = (field/'/fullname').inner_html
      description = (field/'description').inner_html

Samples per series/dataset in the NCBI GEO database

Andrew asks:

I want to get an NCBI GEO report showing the number of samples per series or data set. Short of downloading all of GEO, anyone know how to do this? Is there a table of just metadata hidden somewhere?

At work, we joke that GEO is the only database where data goes in, but it won’t come out. However, there is an alternative: the GEOmetadb package, available from Bioconductor.

The R code first, then some explanation:

# install GEOmetadb

# connect to database
con <- dbConnect(SQLite(), "GEOmetadb.sqlite")

# count samples per GDS
gds.count <- dbGetQuery(con, "select gds,sample_count from gds")
# first 5 results
     gds sample_count
1   GDS5            5
2   GDS6           29
3  GDS10           28
4  GDS12            8
5  GDS15            6
# count samples per GSE
gse <- dbGetQuery(con, "select series_id from gsm")
gse.count <- as.data.frame(table(gse$series_id))
# first 10 results
                Var1 Freq
1               GSE1   38
2              GSE10    4
3             GSE100    4
4           GSE10000   29
5           GSE10001   12
6           GSE10002    8
7           GSE10003    4
8  GSE10004,GSE10114    3
9           GSE10005   48
10          GSE10006   75

We install GEOmetadb (lines 2-4), then download and unpack the SQLite database (line 7). This generates the file ~/GEOmetadb.sqlite, which is currently a little over 1 GB.

Next, we connect to the database via RSQLite (lines 7-8). The gds table contains GDS dataset accession and sample count, so extracting that information is very easy (line 11).

GSE series are a little different. The gsm table contains GSM sample accession and GSE series accession (in the series_id field). We can count up the samples per series using table(), on line 22. However, this generates some odd-looking results, such as:

          Var1          Freq
15    GSE10011,GSE10026 45
14652 GSE9973,GSE10026   9
14654 GSE9975,GSE10026  36
14656 GSE9977,GSE10026  24

Fear not. In this case, GSE10026 is a super-series comprised from the series GSE10011 (45 samples), GSE9973 (9 samples), GSE9975 (36 samples) and GSE9977 (24 samples), total = 114 samples.

15 year-old error results in improved performance?

Here’s an interesting letter in the current issue of Nature Biotechnology (subscription only):

In the course of analyzing the evolution of the Blocks database2, we noticed errors in the software source code used to create the initial BLOSUM family of matrices [...] The result of these errors is that the BLOSUM matrices—BLOSUM62, BLOSUM50, etc.—are quite different from the matrices that should have been calculated using the algorithm described by Henikoff and Henikoff. Obviously, minor errors in research, and particularly in software source code, are quite common. This case is noteworthy for three reasons: first, the BLOSUM matrices are ubiquitous in computational biology; second, these errors have gone unnoticed for 15 years; and third, the ‘incorrect’ matrices perform better than the ‘intended’ matrices.

Are they right? Does it matter?