Exploring the NCBI taxonomy database using Entrez Direct

I’ve been meaning to write about Entrez Direct, henceforth called edirect, for some time. This tweet provided me with an excuse:

This post is not strictly the answer to that question. Instead we’ll ask: which parent IDs of records for insects in the NCBI Taxonomy database have the most species IDs?

1. Download insect records in XML format

We should really do this via the command line, but it’s just as easy to search at the website for taxonomy ID 50557 (corresponding to Insecta), then choose “Send to -> File -> Format -> XML”. You’ll need a good network connection since the output file, taxonomy_result.xml, is currently ~ 1.15 GB and contains 238 649 records. Click the title on the thumbnail video for the full-size readable version.

2. Extract using xtract
My first exploratory attempt to use xtract on this file resulted in an error:

Substitution loop at /home/sau103/bin/xtract line 1730, <STDIN> line 1

Some Google searching indicated that this might be a large file size issue. So I installed xml_split and split the large XML file into smaller files of ~ 10 MB size:

# Ubuntu-like systems, obviously
sudo apt-get install xml-twig-tools
xml_split -s 10MB taxonomy_result.xml

This generated files taxonomy_result-NNN.xml, where NNN ranges from 00 – 115. The “00” file is a master index file that we don’t need, so I renamed that one:

mv taxonomy_result-00.xml taxonomy_result-00.xml.bk

and then wrote a small shell script, taxa.sh, to process the remaining files using xtract:

#!/bin/sh

echo $1
cat $1 | xtract -Group Taxon -sfx "\n" -tab "" -ret "" -first ParentTaxId,TaxId,ScientificName,Rank > $1.tsv

and then ran it like this:

find ./ -name "taxonomy_result-*.xml" -exec sh taxa.sh {} \;
cat *.tsv > taxa.txt; mv taxa.txt taxa.tsv

That generates 115 files with the suffix .tsv, then concatenates them into one file. It looks like this:

head taxa.tsv 

511139	1637461	Neocnemodon vitripennis	species
511139	1637460	Neocnemodon larusi	species
511139	1637458	Neocnemodon brevidens	species
58550	1636980	Tetriginae	subfamily
58550	1636979	Scelimeninae	subfamily
58550	1636978	Metrodorinae	subfamily
58550	1636974	Cladonotinae	subfamily
58550	1636918	Batrachideinae	subfamily
1369076	1636613	Ctimene basistriga	species
224226	1633492	Merodon rufus	species

3. Counting species
For those lines where the Rank = species, sum the occurrence of each ParentTaxId (column 1). We’ll take the top 20:

grep -P "\tspecies$" taxa.tsv | cut -f1 | sort | uniq -c | sort -nrk1 | head -20 > taxa20.txt
cat taxa20.txt

  41520 500585
  13672 156408
   8374 265461
   4384 593869
   1980 473556
   1524 718704
    897 173037
    722 55939
    578 712022
    507 705566
    474 309606
    466 13390
    445 704302
    425 7403
    371 329961
    366 657549
    343 32390
    330 190769
    310 129993
    297 211548

4. Retrieve information for parent IDs
Now we can take the top 20 ParentTaxIDs, paste them into a comma-separated string and use that as an argument to efetch:

efetch -db taxonomy -id `awk '{print $2}' < taxa20.txt | paste -s -d,` -format xml | \
xtract -pattern TaxaSet -Group Taxon -sfx "\n" -tab "" -ret "" -first \
 ParentTaxId,TaxId,ScientificName,Rank > taxa20fetch.txt

We can view the names, counts and ParentTaxIds with a little paste and cut. The final result:

paste taxa20fetch.txt taxa20.txt | cut -f3,5

unclassified Lepidoptera	  41520 500585
unclassified Hymenoptera	  13672 156408
unclassified Diptera	   8374 265461
unclassified Coleoptera	   4384 593869
unclassified Trichoptera	   1980 473556
unclassified Hemiptera	   1524 718704
unclassified Ichneumonidae	    897 173037
Aleiodes	    722 55939
unclassified Microgastrinae	    578 712022
unclassified Noctuidae	    507 705566
unclassified Ephemeroptera	    474 309606
Camponotus	    466 13390
Eois	    445 704302
Apanteles	    425 7403
unclassified Cecidomyiidae	    371 329961
Trigonopterus	    366 657549
Cotesia	    343 32390
Pheidole	    330 190769
Eupristina	    310 129993
unclassified Orthoptera	    297 211548

So the winner in terms of most species records per parent is “unclassified moths”. For Rank = genus, the winner is Aleoides, which NCBI calls “wasps &c.”

Remember: the NCBI Taxonomy database is really a linked resource for sequence records, not strictly speaking a taxonomy resource. In their own words from the front page: “a curated classification and nomenclature for all of the organisms in the public sequence databases. This currently represents about 10% of the described species of life on the planet.”

2 thoughts on “Exploring the NCBI taxonomy database using Entrez Direct

  1. Jonathan Badger

    I don’t if it is the case for insects, but I know that for microbes, the NCBI taxonomy is pretty outdated in places, with many groups that are no longer considered to be supported by molecular phylogeny. The SILVA taxonomy seems to be more up to date and is what people generally use (for example) for microbiome studies.

  2. Pingback: Exploring the NCBI taxonomy database using Entr...

Comments are closed.