While we’re on the topic of mistaking Archaea for Bacteria, here’s an issue with the NCBI FTP site that has long annoyed me and one workaround. Warning: I threw this together minutes ago and it’s not fully tested.
Update July 7 2014: NCBI have changed things so code in this post no longer works
Let’s cut to the chase. In the NCBI FTP site, archaeal genome data is stored along with bacterial genomes in a single directory named Bacteria.
Aside from the fact that this is taxonomically incorrect, it makes bulk retrieval of archaeal data rather difficult. For example, I know that Methanococcoides burtonii is an archaeon and if I want to download its protein-coding genes (files ending with .ffn), I can do:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcoides_burtonii_DSM_6242_uid58023/*.ffn
What if I want all available ffn files for Archaea? Somehow, I need to know which of the organism names in the Bacteria directory correspond to Archaea. So here is one approach.
1. Download the summary file prokaryotes.txt
Leaving aside issues with the term prokaryote, the NCBI FTP site contains a useful tab-delimited file which summarises current bacterial and archaeal genomes.
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt
2. Extract the Archaea
We could use shell tools (awk, grep, cut and so on) but I’m going to go with R. First, read in the file and examine the contents of the Group column:
p <- read.table("prokaryotes.txt", header = T, sep = "\t", quote = "", comment.char = "") table(p$Group) # Actinobacteria Aquificae # 2040 18 # Armatimonadetes Bacteroidetes/Chlorobi group # 1 618 # Caldiserica Chlamydiae/Verrucomicrobia group # 1 232 # Chloroflexi Chrysiogenetes # 29 2 # Crenarchaeota Cyanobacteria # 80 226 # Deferribacteres Deinococcus-Thermus # 6 54 # Dictyoglomi Elusimicrobia # 2 2 # environmental samples Euryarchaeota # 1 283 # Fibrobacteres/Acidobacteria group Firmicutes # 18 5163 # Fusobacteria Gemmatimonadetes # 70 1 # Korarchaeota Nanoarchaeota # 1 1 # Nitrospinae Nitrospirae # 5 10 # Planctomycetes Proteobacteria # 28 10793 # Spirochaetes Synergistetes # 461 18 # Tenericutes Thaumarchaeota # 191 17 # Thermodesulfobacteria Thermotogae # 6 42 # unclassified Archaea unclassified Bacteria # 1 19
Only the Archaea contain the string “archae”, so we can extract them using grep():
a <- p[grep("archae", p$Group),]
3. Construct FTP URLs
Now: each organism in the FTP site has its own directory of the form:
ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ORGN_uidNNNN
where ORGN and NNNN correspond to the columns #Organism Name and BioProject ID, respectively, in the file prokaryotes.txt. This was not always the case and may not be so in the future but right now, it is. So in R, we can create a new column for the URL. We could stay in R, but let’s write out the FTP URLs to a file. Using write.table() for a single column is not really appropriate, but it works.
a$url <- paste("ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/", gsub(" ", "_", a$X.Organism.Name), "_uid", a$BioProject.ID, sep = "") write.table(a$url, file = "archaea_urls.txt", col.names = F, row.names = F, quote = F)
That gives us (first 5 lines):
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid15999 ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_S2_uid58035 ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C5_uid58741 ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C7_uid58847 ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Methanococcus_maripaludis_C6_uid58947 ...
4. Download
Now, to get those ffn files it’s as simple as sending each line in the file of URLs to wget:
cat archaea_urls.txt | while read LINE; do wget "$LINE/*.ffn"; done
Some of those URLs will not exist. That’s OK, wget will just report “No such directory” and the shell script will continue on its merry way. I get 172 files, which seems “about right”.
A lot of day-to-day bioinformatics involves these types of “how do I even get the data” tasks. Life would be much easier and research quicker if the NCBI would just put the Archaea in their own directory.
Nice post. I generally do this kind of thing in awk and then loop over the URLs using xargs:
awk -F”\t” ‘$5~/archae/{gsub(/ /,”_”,$1);print “ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/”$1″_uid”$4″/*.ffn”}’ prokaryotes.txt | xargs -i wget “{}”
Adding -P 2 to the xargs command will make it run 2 downloads in parallel.