During all the recent discussion around Neandertals and modern humans, it’s often pointed out that Homo sapiens is the sole extant representative of the genus Homo. I began to wonder “how unusual is this?” in a FriendFeed comment thread. What resources exist that could help us to answer this question?
Genera that contain only one species are termed monotypic. Wikipedia even has a category page for this topic but their lists are limited, since Wikipedia is not a comprehensive taxonomy resource.
Taxonomy is not my specialty but once in a while, I enjoy challenging myself with unfamiliar resources and data types. I figured initially that we could get some way towards an answer using BioSQL and the NCBI taxonomy database. As it turned out I was completely wrong, but it was an interesting educational exercise. I turned instead to a “real” taxonomy resource, the Integrated Taxonomic Information System, or ITIS.
First, I set up the ITIS database:
# fetch and unpack wget http://www.itis.gov/downloads/itisMySQL012710_v3.TAR.gz tar zxvf itisMySQL012710_v3.TAR.gz # Problem - 2 versions of the SQL setup file cp dropcreateloaditis.sql itisMySQL020210/ cd itisMySQL020210 # and load into MySQL mysql -u root -p --enable-local-infile < dropcreateloaditis.sql
A couple of minor issues here. First, ITIS, if your tarball name contains TAR in upper-case, Linux tab-completion doesn’t work. Second, confusingly, unpacking the tarball generates two files named dropcreateloaditis.sql: one inside the directory itisMySQL020210 and another one directory level up. The former does not work properly, the latter does.
OK, a brand new database with an unfamiliar schema. Some poking around in the MySQL console shows 24 tables. To make a long story short, the table taxon_unit_types contains a field named rank_id, which shows that “species” have a rank_id value of 220. The table taxonomic_units contains lots of fields, including the rank_id and a field called unit_name1 which for species records, appears to indicate the genus. There’s also a field in taxonomic_units called name_usage which takes values of “invalid”, “valid”, “accepted” or “not accepted”. I assume that it’s best to stick with “valid” or “accepted”.
So, to count species per genera, we can try something like this:
SELECT unit_name1, count(*) AS species FROM taxonomic_units WHERE rank_id = 220 AND (name_usage = 'valid' OR name_usage = 'accepted') GROUP BY unit_name1 ORDER BY species DESC INTO OUTFILE '/tmp/itis.txt';
Here are the first few lines of the resulting output file:
head /tmp/itis.txt Lasioglossum 1740 Megachile 1522 Andrena 1495 Camponotus 965 Hylaeus 709 Nomada 701 Rhyacophila 647 Perdita 631 Pheidole 549 Chimarra 537
A quick cross-check using a few genus names at the ITIS website seems to confirm that we are counting species per genera correctly. So, how many did we retrieve and how many have only one species?
# total records wc -l /tmp/itis.txt 41723 itis.txt # records with number 1 in second column grep -P "\t1$" itis.txt | wc -l 16786 # one of those is Homo, right? grep -P "^Homo\t" itis.txt Homo 1
It seems then that around 40% of valid or accepted genera, as retrieved from ITIS, contain one species – assuming that I have not made an error in my SQL query. This raises some questions. Does this mean that humans are not particularly unusual in being the sole extant representative of Homo? How complete a resource is ITIS? 40% seems high – are there really so many monotypic genera, or is it more likely that many genera contain as-yet undescribed species?
I venture back onto safe ground and leave these questions to the experts.