How to: bulk retrieval of archaeal genome sequences from the NCBI FTP site

While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.

Update July 7 2014: NCBI have changed things so code in this post no longer works

Let’s cut to the chase. In the NCBI FTP site, archaeal genome data is stored along with bacterial genomes in a single directory named Bacteria.

Aside from the fact that this is taxonomically incorrect, it makes bulk retrieval of archaeal data rather difficult. For example, I know that Methanococcoides burtonii is an archaeon and if I want to download its protein-coding genes (files ending with .ffn), I can do:


What if I want all available ffn files for Archaea? Somehow, I need to know which of the organism names in the Bacteria directory correspond to Archaea. So here is one approach.

1. Download the summary file prokaryotes.txt

Leaving aside issues with the term prokaryote, the NCBI FTP site contains a useful tab-delimited file which summarises current bacterial and archaeal genomes.


2. Extract the Archaea

We could use shell tools (awk, grep, cut and so on) but I’m going to go with R. First, read in the file and examine the contents of the Group column:

p <- read.table("prokaryotes.txt", header = T, sep = "\t", quote = "", comment.char = "")

#                    Actinobacteria                         Aquificae 
#                              2040                                18 
#                   Armatimonadetes      Bacteroidetes/Chlorobi group 
#                                 1                               618 
#                       Caldiserica  Chlamydiae/Verrucomicrobia group 
#                                 1                               232 
#                       Chloroflexi                    Chrysiogenetes 
#                                29                                 2 
#                     Crenarchaeota                     Cyanobacteria 
#                                80                               226 
#                   Deferribacteres               Deinococcus-Thermus 
#                                 6                                54 
#                       Dictyoglomi                     Elusimicrobia 
#                                 2                                 2 
#             environmental samples                     Euryarchaeota 
#                                 1                               283 
# Fibrobacteres/Acidobacteria group                        Firmicutes 
#                                18                              5163 
#                      Fusobacteria                  Gemmatimonadetes 
#                                70                                 1 
#                      Korarchaeota                     Nanoarchaeota 
#                                 1                                 1 
#                       Nitrospinae                       Nitrospirae 
#                                 5                                10 
#                    Planctomycetes                    Proteobacteria 
#                                28                             10793 
#                      Spirochaetes                     Synergistetes 
#                               461                                18 
#                       Tenericutes                    Thaumarchaeota 
#                               191                                17 
#             Thermodesulfobacteria                       Thermotogae 
#                                 6                                42 
#              unclassified Archaea             unclassified Bacteria 
#                                 1                                19

Only the Archaea contain the string “archae”, so we can extract them using grep():

a <- p[grep("archae", p$Group),]

3. Construct FTP URLs

Now: each organism in the FTP site has its own directory of the form:

where ORGN and NNNN correspond to the columns #Organism Name and BioProject ID, respectively, in the file prokaryotes.txt. This was not always the case and may not be so in the future but right now, it is. So in R, we can create a new column for the URL. We could stay in R, but let’s write out the FTP URLs to a file. Using write.table() for a single column is not really appropriate, but it works.

a$url <- paste("", 
               gsub(" ", "_", a$X.Organism.Name), 
               "_uid", a$BioProject.ID, sep = "")
write.table(a$url, file = "archaea_urls.txt", col.names = F, row.names = F, quote = F)

That gives us (first 5 lines):

4. Download

Now, to get those ffn files it’s as simple as sending each line in the file of URLs to wget:

cat archaea_urls.txt | while read LINE; do wget "$LINE/*.ffn"; done

Some of those URLs will not exist. That’s OK, wget will just report “No such directory” and the shell script will continue on its merry way. I get 172 files, which seems “about right”.

A lot of day-to-day bioinformatics involves these types of “how do I even get the data” tasks. Life would be much easier and research quicker if the NCBI would just put the Archaea in their own directory.

One thought on “How to: bulk retrieval of archaeal genome sequences from the NCBI FTP site

  1. Nice post. I generally do this kind of thing in awk and then loop over the URLs using xargs:

    awk -F”\t” ‘$5~/archae/{gsub(/ /,”_”,$1);print “”$1″_uid”$4″/*.ffn”}’ prokaryotes.txt | xargs -i wget “{}”

    Adding -P 2 to the xargs command will make it run 2 downloads in parallel.

Comments are closed.