I’ve been meaning to write about Entrez Direct, henceforth called edirect, for some time. This tweet provided me with an excuse:
This post is not strictly the answer to that question. Instead we’ll ask: which parent IDs of records for insects in the NCBI Taxonomy database have the most species IDs?
1. Download insect records in XML format
We should really do this via the command line, but it’s just as easy to search at the website for taxonomy ID 50557 (corresponding to Insecta), then choose “Send to -> File -> Format -> XML”. You’ll need a good network connection since the output file, taxonomy_result.xml, is currently ~ 1.15 GB and contains 238 649 records. Click the title on the thumbnail video for the full-size readable version.
2. Extract using xtract
My first exploratory attempt to use xtract on this file resulted in an error:
Substitution loop at /home/sau103/bin/xtract line 1730, <STDIN> line 1
Some Google searching indicated that this might be a large file size issue. So I installed xml_split and split the large XML file into smaller files of ~ 10 MB size:
# Ubuntu-like systems, obviously sudo apt-get install xml-twig-tools xml_split -s 10MB taxonomy_result.xml
This generated files taxonomy_result-NNN.xml, where NNN ranges from 00 – 115. The “00” file is a master index file that we don’t need, so I renamed that one:
mv taxonomy_result-00.xml taxonomy_result-00.xml.bk
and then wrote a small shell script, taxa.sh, to process the remaining files using xtract:
#!/bin/sh echo $1 cat $1 | xtract -Group Taxon -sfx "\n" -tab "" -ret "" -first ParentTaxId,TaxId,ScientificName,Rank > $1.tsv
and then ran it like this:
find ./ -name "taxonomy_result-*.xml" -exec sh taxa.sh {} \; cat *.tsv > taxa.txt; mv taxa.txt taxa.tsv
That generates 115 files with the suffix .tsv, then concatenates them into one file. It looks like this:
head taxa.tsv 511139 1637461 Neocnemodon vitripennis species 511139 1637460 Neocnemodon larusi species 511139 1637458 Neocnemodon brevidens species 58550 1636980 Tetriginae subfamily 58550 1636979 Scelimeninae subfamily 58550 1636978 Metrodorinae subfamily 58550 1636974 Cladonotinae subfamily 58550 1636918 Batrachideinae subfamily 1369076 1636613 Ctimene basistriga species 224226 1633492 Merodon rufus species
3. Counting species
For those lines where the Rank = species, sum the occurrence of each ParentTaxId (column 1). We’ll take the top 20:
grep -P "\tspecies$" taxa.tsv | cut -f1 | sort | uniq -c | sort -nrk1 | head -20 > taxa20.txt cat taxa20.txt 41520 500585 13672 156408 8374 265461 4384 593869 1980 473556 1524 718704 897 173037 722 55939 578 712022 507 705566 474 309606 466 13390 445 704302 425 7403 371 329961 366 657549 343 32390 330 190769 310 129993 297 211548
4. Retrieve information for parent IDs
Now we can take the top 20 ParentTaxIDs, paste them into a comma-separated string and use that as an argument to efetch:
efetch -db taxonomy -id `awk '{print $2}' < taxa20.txt | paste -s -d,` -format xml | \ xtract -pattern TaxaSet -Group Taxon -sfx "\n" -tab "" -ret "" -first \ ParentTaxId,TaxId,ScientificName,Rank > taxa20fetch.txt
We can view the names, counts and ParentTaxIds with a little paste and cut. The final result:
paste taxa20fetch.txt taxa20.txt | cut -f3,5 unclassified Lepidoptera 41520 500585 unclassified Hymenoptera 13672 156408 unclassified Diptera 8374 265461 unclassified Coleoptera 4384 593869 unclassified Trichoptera 1980 473556 unclassified Hemiptera 1524 718704 unclassified Ichneumonidae 897 173037 Aleiodes 722 55939 unclassified Microgastrinae 578 712022 unclassified Noctuidae 507 705566 unclassified Ephemeroptera 474 309606 Camponotus 466 13390 Eois 445 704302 Apanteles 425 7403 unclassified Cecidomyiidae 371 329961 Trigonopterus 366 657549 Cotesia 343 32390 Pheidole 330 190769 Eupristina 310 129993 unclassified Orthoptera 297 211548
So the winner in terms of most species records per parent is “unclassified moths”. For Rank = genus, the winner is Aleoides, which NCBI calls “wasps &c.”
Remember: the NCBI Taxonomy database is really a linked resource for sequence records, not strictly speaking a taxonomy resource. In their own words from the front page: “a curated classification and nomenclature for all of the organisms in the public sequence databases. This currently represents about 10% of the described species of life on the planet.”
I don’t if it is the case for insects, but I know that for microbes, the NCBI taxonomy is pretty outdated in places, with many groups that are no longer considered to be supported by molecular phylogeny. The SILVA taxonomy seems to be more up to date and is what people generally use (for example) for microbiome studies.
Pingback: Exploring the NCBI taxonomy database using Entr...