Virus hosts from NCBI Taxonomy web pages

A Biostars question asks whether the information about virus host on web pages like this one can be retrieved using Entrez Utilities.

Pretty sure that the answer is no, unfortunately. Sometimes there’s no option but to scrape the web page, in the knowledge that this approach may break at any time. Here’s some very rough and ready Ruby code without tests or user input checks. It takes the taxonomy UID and returns the host, if there is one. No guarantees now or in the future!


require 'nokogiri'
require 'open-uri'

def get_host(uid)
	url   = "" + uid.to_s
	doc   = Nokogiri::HTML.parse(open(url).read)
	data  = doc.xpath("//td").collect { |x| x.inner_html.split("<br>") }.flatten
	data.each do |e|
		puts $1 if e =~ /Host:\s+<\/em>(.*?)$/


Save as taxhost.rb and supply the UID as first argument. Note: I chose 12345 off the top of my head, imagining that it was unlikely to be a virus and would make a good negative test. Turns out to be a phage!

$ ruby taxhost.rb 12249
$ ruby taxhost.rb 12721
$ ruby taxhost.rb 11709
vertebrates| human
$ ruby taxhost.rb 12345