Virus hosts from NCBI Taxonomy web pages

A Biostars question asks whether the information about virus host on web pages like this one can be retrieved using Entrez Utilities.

Pretty sure that the answer is no, unfortunately. Sometimes there’s no option but to scrape the web page, in the knowledge that this approach may break at any time. Here’s some very rough and ready Ruby code without tests or user input checks. It takes the taxonomy UID and returns the host, if there is one. No guarantees now or in the future!

#!/usr/bin/ruby

require 'nokogiri'
require 'open-uri'

def get_host(uid)
	url   = "http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&lvl=3&lin=f&keep=1&srchmode=1&unlock&id=" + uid.to_s
	doc   = Nokogiri::HTML.parse(open(url).read)
	data  = doc.xpath("//td").collect { |x| x.inner_html.split("<br>") }.flatten
	data.each do |e|
		puts $1 if e =~ /Host:\s+<\/em>(.*?)$/
	end
end

get_host(ARGV[0])

Save as taxhost.rb and supply the UID as first argument. Note: I chose 12345 off the top of my head, imagining that it was unlikely to be a virus and would make a good negative test. Turns out to be a phage!

$ ruby taxhost.rb 12249
plants
$ ruby taxhost.rb 12721
vertebrates
$ ruby taxhost.rb 11709
vertebrates| human
$ ruby taxhost.rb 12345
bacteria