Virus hosts from NCBI taxonomy: now at Github

After my previous post on extracting virus hosts from NCBI Taxonomy web pages, Pierre wrote:

An excellent idea and here’s my first attempt.

Here’s a count of hosts. By the way NCBI, it’s environment.

cut -f4 virus_host.tsv | sort | uniq -c

   1301 
    283 algae
    114 archaea
   4509 bacteria
      8 diatom
     51 enviroment
    267 fungi
      1 fungi| plants| invertebrates
      4 human
    761 invertebrates
    181 invertebrates| plants
      7 invertebrates| vertebrates
   3979 plants
    102 protozoa
   6834 vertebrates
 115052 vertebrates| human
     43 vertebrates| human  stool
    225 vertebrates| invertebrates
    656 vertebrates| invertebrates| human

Virus hosts from NCBI Taxonomy web pages

A Biostars question asks whether the information about virus host on web pages like this one can be retrieved using Entrez Utilities.

Pretty sure that the answer is no, unfortunately. Sometimes there’s no option but to scrape the web page, in the knowledge that this approach may break at any time. Here’s some very rough and ready Ruby code without tests or user input checks. It takes the taxonomy UID and returns the host, if there is one. No guarantees now or in the future!

#!/usr/bin/ruby

require 'nokogiri'
require 'open-uri'

def get_host(uid)
	url   = "http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&lvl=3&lin=f&keep=1&srchmode=1&unlock&id=" + uid.to_s
	doc   = Nokogiri::HTML.parse(open(url).read)
	data  = doc.xpath("//td").collect { |x| x.inner_html.split("<br>") }.flatten
	data.each do |e|
		puts $1 if e =~ /Host:\s+<\/em>(.*?)$/
	end
end

get_host(ARGV[0])

Save as taxhost.rb and supply the UID as first argument. Note: I chose 12345 off the top of my head, imagining that it was unlikely to be a virus and would make a good negative test. Turns out to be a phage!

$ ruby taxhost.rb 12249
plants
$ ruby taxhost.rb 12721
vertebrates
$ ruby taxhost.rb 11709
vertebrates| human
$ ruby taxhost.rb 12345
bacteria

Exploring the NCBI taxonomy database using Entrez Direct

I’ve been meaning to write about Entrez Direct, henceforth called edirect, for some time. This tweet provided me with an excuse:

This post is not strictly the answer to that question. Instead we’ll ask: which parent IDs of records for insects in the NCBI Taxonomy database have the most species IDs?
Continue reading

Oops: taxonomy #fail

My journey from bench scientist to bioinformatician began with archaeal genomes. So I was somewhat startled to read The catalytic mechanism for aerobic formation of methane by bacteria, in which we learn about the “ocean-dwelling bacterium Nitrosopumilus maritimus“.

So was Jonathan Eisen of course and you should go and read why. Every top hit in a Web search for that organism tells us that Nitrosopumilus maritimus is an archaeon.

Looking forward to a rapid correction and apology from Nature.

Title edited from “phylogeny” to “taxonomy” at the insistence of @BioinfoTools ;)

How many monotypic genera?

During all the recent discussion around Neandertals and modern humans, it’s often pointed out that Homo sapiens is the sole extant representative of the genus Homo. I began to wonder “how unusual is this?” in a FriendFeed comment thread. What resources exist that could help us to answer this question?

Genera that contain only one species are termed monotypic. Wikipedia even has a category page for this topic but their lists are limited, since Wikipedia is not a comprehensive taxonomy resource.

Taxonomy is not my specialty but once in a while, I enjoy challenging myself with unfamiliar resources and data types. I figured initially that we could get some way towards an answer using BioSQL and the NCBI taxonomy database. As it turned out I was completely wrong, but it was an interesting educational exercise. I turned instead to a “real” taxonomy resource, the Integrated Taxonomic Information System, or ITIS.
Read the rest…