My own “404 not found”: making amends using Github

This post is an apology and an attempt to make amends for contributing to the decay of online bioinformatics resources. It’s also, I think, a nice example of why reproducible research can be difficult.

Come back in time with me 10 years, to 2004.

1. Introduction

In 2004, I was working on archaeal genomes. One day in the lab, we were wondering why there are no known archaeal pathogens of humans. As a first step, we wondered if there was a way to look for nucleotide sequences from human-associated Archaea.

You’re thinking: “why not just sequence the human microbiome?” Yes, that is the 2014 solution. This, however, was 2004 when next-generation sequencing and metagenomics were just emerging. So we had a different idea: screen the EST database for transcripts that are annotated as human but which more closely resemble archaeal sequences. Some of those would, hopefully, be derived from human-associated Archaea.

I came up with a strategy, described in the next section and generated some not-especially-promising results. Back in 2004, the ability to create a MySQL database with a PHP front-end was considered to be an impressive skill in bioinformatics, so we set up a web server and wrote a paper called An online database for the detection of novel archaeal sequences in human ESTs. Today, this is so trivial that I doubt it would even be sent out for review. I still laugh when I think of a comment from one of the reviewers, that the work was of major interest due to its potential use in fighting bio-terrorism.

Screenshot from 2014-07-21 12:25:06.png

Possibly even worse than a 404 not found

And then I left the group. Click on the URL in the Availability section of that paper and after a long wait, you’ll see something like this, at right.

I apologise. In 2004, I gave little if any thought to the problem of maintaining online resources and my attitude, like many, was that the academic publication was the only goal, the endpoint. How very wrong I was.

2. The strategy

Let’s look in more detail at the process for screening the EST database.

You might ask: why not simply BLAST search ESTs (query) versus nt (subject) and parse the output for cases where the top hit was to an archaeal sequence? Two main reasons.

First – infrastructure. We did not have access to parallel BLAST on a large cluster, which would be required given the size of both query and subject.

Second – efficiency. The vast majority of EST sequences in the human subset of dbEST really are derived from human transcripts. It makes more sense to perform an initial screen for those sequences that might not be human, then follow up those results using BLAST.

So here’s the screening strategy that I devised.

  1. Grab all known coding sequences from archaeal genomes to use as a search database.
  2. Use BLAT to search human EST sequences against the archaeal sequences. Using BLAT speeds up the process since files can be split to fit into available memory, where BLAT runs, then the tab-delimited output can easily be recombined.
  3. Extract the much smaller subset of EST sequences that had a hit in step 2 and use those as queries in a BLAST search versus the large non-redundant nt nucleotide database.
  4. Parse the BLAST report and assign species to the top hit for each EST query.

3. The 2014 solution

I’ve long wanted not only to update the results and make them available, but also to automate the procedure and try to make it reproducible. My current solution to automation is a Ruby Rakefile. It works for me on my current system, meaning that I can at least run the pipeline at regular intervals with minimal effort or input on my part.

Whether anyone else could is another question. For reference: the server on which I work has 24 6-core CPUs, 96 GB RAM, plenty (TB) of storage and runs Ubuntu Linux. Other requirements should be apparent as we examine the contents of the Rakefile. It should be clear that this is a far-from-perfect work in progress.

My solution to availability is, naturally, a Github repository.

Feel free to skip the rest of this section if technical discussion of Rakefiles is not your thing.

Checking for prerequisites

The rake check task looks for required programs in $PATH, using the useful find_executables method and gives a hint about where to find them, if not installed.

desc "Check for prerequisites"
task :check do
	# check for various tools in PATH
	# we need seq, GNU parallel, wget, fastasplitn, blat, blastn, blastdbcmd
	cmds = {
		"seq" => "seq is part of coreutils",
		"parallel" => "GNU Parallel is at",
		"wget" => "GNU Wget is at",
		"fastasplitn" => "fastasplitn can be compiled from the src/ directory",
		"blat" => "BLAT is at",
		"blastn" => "The latest BLAST+ is at",
		"blastdbcmd" => "The latest BLAST+ is at"
	cmds.each_pair do |k, v|
		f = find_executable k
		if f.nil?
			puts "#{k} not found in path; #{v}"

3.1 Fetch archaeal sequences

The Rakefile uses two namespaces (not shown in the following code snippets): db for database-related tasks and search, for BLAT/BLAST search-related tasks. Further namespaces are used for the type of database: archaea, est or nt. So we fetch archaeal sequences using rake db:archaea:fetch. This code uses a file from the NCBI FTP server to determine the FTP URL for archaeal genome sequence data.

		desc "Fetch archaea genome ffn files"
		task :fetch do
			Dir.chdir(File.expand_path("../../../db/archaea", __FILE__))
			ftp =
			ftp.gettextfile("prokaryotes.txt", nil) do |line|
				line   = line.chomp
				fields = line.split("\t")
				if fields[4] =~ /[Aa]rchae/ && fields[23] != "-"
					puts fields[23]
					system("wget -N ftp://#{ncbiftp}/genomes/ASSEMBLY_BACTERIA/#{fields[23]}/*.ffn*")

3.2 Process archaeal sequences

The fetch task returns two kinds of files: those of the form NC_*.ffn from annotated genomes and compressed NZ*.tgz files, which uncompress to *.ffn, from incompletely-annotated genomes. The task rake db:archaea:concat combines them all into one FASTA file.

		desc "Uncompress archaea *..scaffold.ffn.tgz files; combine with *.ffn files"
		task :concat do
			Dir.chdir(File.expand_path("../../../db/archaea", __FILE__))
			puts "Extracting tgz files to NZ.ffn..."
			system("find . -name '*.tgz' -exec tar zxfO {} > NZ.ffn \\;")
			puts "Combining NZ.ffn + NC*.ffn files..."
			system("cat NZ.ffn NC_0*.ffn > archaea.ffn")
			puts "Finished; files are in #{Dir.pwd}"

3.3 Fetch nt BLAST database

Fetching the other databases is more straightforward. First, the nt BLAST database, using rake db:nt:fetch. You can see that this is currently not very reproducible, since it makes some assumptions about the number and names of files to download.

		desc "Fetch and uncompress nt BLAST db"
		task :fetch do
			# fetch code here
			ftp =
			nt = ftp.list("nt.*.tar.gz")
			nt = { |f| f.split(" ").last }
			Dir.chdir(File.expand_path("../../../db/nt", __FILE__))
			system("seq 0 9 | parallel -j10 wget -N{}.tar.gz")
			system("seq 10 #{nt.count - 1} | parallel -j#{nt.count - 10} wget -N{}.tar.gz")
			# now need to uncompress
			system("find . -name 'nt.*.tar.gz' -exec tar zxvf {} \\;")
			puts "Finished; files are in #{Dir.pwd}"

3.4 Fetch human EST BLAST database

In similar fashion to the previous task, rake db:est:fetch downloads and extracts the human EST BLAST database. Again there are assumptions about number and names of files.

		desc "Fetch and uncompress human EST BLAST db"
		task :fetch do
			ftp =
			est = ftp.list("est_human*.tar.gz")
			est = { |f| f.split(" ").last }
			Dir.chdir(File.expand_path("../../../db/est", __FILE__))
			system("seq 0 #{est.count - 1} | parallel -j#{est.count} wget -N ftp://#{ncbiftp}/blast/db/est_human.0{}.tar.gz")
			# now need to uncompress
			system("find . -name 'est_human.*.tar.gz' -exec tar zxvf {} \\;")
			puts "Finished; files are in #{Dir.pwd}"

3.5 Dump human ESTs to multiple FASTA files

We can dump the human EST BLAST database to FASTA, splitting the output into multiple files in order to speed up the BLAT search later on. There are lots of tools for splitting FASTA files; I found that the best for this pipeline is fastasplitn. The task is rake db:est:split.

Once again there is some regrettable hard-coding here for the number of output files (20). It would be better to write code which could determine an optimal number of output files.

		desc "Dump human EST BLAST db to fasta, splitting for BLAT"
		task :split do
			# code to dump fasta and split
			Dir.chdir(File.expand_path("../../../db/est", __FILE__))
			ENV['SPLITFRAGTEMPLATE'] = "est_human%3.3d"
			system("blastdbcmd -db est_human -entry all -line_length 60 | fastasplitn - 20")
			puts "Finished; files are in #{Dir.pwd}"

3.6 BLAT search human ESTs versus archaeal sequences

We’re ready to BLAT, using rake search:blat:run. This generates 20 output PSL files.

		desc "Run BLAT of human EST fasta files (query) v. archaea.ffn (subject)"
		task :run do
			Dir.chdir(File.expand_path("../../../", __FILE__))
			system("parallel -j10 'var=$(printf '%03d' {}); blat db/archaea/archaea.ffn db/est/est_human$var -ooc=data/11.ooc data/psl/est_human$var.psl' ::: $(seq 1 10)")
			system("parallel -j10 'var=$(printf '%03d' {}); blat db/archaea/archaea.ffn db/est/est_human$var -ooc=data/11.ooc data/psl/est_human$var.psl' ::: $(seq 11 20)")
			puts "Finished; output PSL files are in #{Dir.pwd}/data/psl"

3.7 Parse BLAT reports

Next, we want to retrieve unique EST query GIs from the BLAT reports. The task for that is rake search:blat:parse and the GIs are written to a file, one per line.

		desc "Parse PSL files, extract list of unique human EST GI, write to file"
		task :parse do
			Dir.chdir(File.expand_path("../../../data", __FILE__))
			system("cat psl/*.psl | cut -f10 | grep ^gi | cut -d '|' -f2 | sort -u > est_hits_gi.txt")
			puts "Finished; GIs in file #{Dir.pwd}/est_hits_gi.txt"			

3.8 Dump EST hits to FASTA file

Now we can use the list of GIs from the previous step to dump those ESTs with similarity to archaeal sequences into a FASTA file, using rake db:est:hitdump.

		desc "Dump GIs in data/est_hits_gi.txt to fasta"
		task :hitdump do
			Dir.chdir(File.expand_path("../../..", __FILE__))
			system("blastdbcmd -db db/est/est_human -entry_batch data/est_hits_gi.txt -line_length 60 -out data/est_hits.fa")
			puts "Finished; sequences are in #{Dir.pwd}/data/est_hits.fa"			

3.9 BLAST search EST hits versus nt database

We’re getting close. Archaeal-like ESTs versus nt database using rake search:blast:run. The output format is tab-delimited and includes taxon IDs, important for the next step. On my system BLAST crashes if more than 4 threads are specified, hence the hard-coded limit.

		desc "BLAST search human EST BLAT hits to archaea v. nt database"
		task :run do
			Dir.chdir(File.expand_path("../../../", __FILE__))
			format = "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids" 
			system("blastn -db db/nt/nt -query data/est_hits.fa -out data/est_v_nt.tsv -max_target_seqs 1 -outfmt '#{format}' -num_threads 4")
			puts "Finished; output in #{Dir.pwd}/data/est_v_nt.tsv"

3.10 Add species information to BLAST report

The final task, rake search:blast:parse, uses the BioRuby implementation of NCBI EUtils/Efetch and some XML parsing to determine species and kingdom from the taxon IDs and add them to the BLAST output, in a new CSV file.

		desc "Parse EST v nt BLAST output; use taxid to retrieve species and kingdom"
		task :parse do
			Bio::NCBI.default_email = ""
			taxids =
			Dir.chdir(File.expand_path("../../../", __FILE__))"data/est_v_nt.csv", "w") do |f|
				File.readlines("data/est_v_nt.tsv").each do |line|
					line = line.chomp
					fields = line.split("\t").map(&:strip)
					if taxids.has_key? fields[12]
						fields = fields + taxids[fields[12]]
						xml   = Bio::NCBI::REST::EFetch.taxonomy(fields[12], "xml")
						doc   = Nokogiri::XML(xml)
						names = doc.xpath("//Taxon/ScientificName") { |child| child.inner_text }
						taxids[fields[12]] = [names[0], names[2]]
						fields = fields + taxids[fields[12]]
					f.write(fields.join(",") + "\n")
			puts "Finished; output in #{Dir.pwd}/data/est_v_nt.csv"		

4. The results

Let’s count up the kingdoms in the output file:

cut -f15 -d "," data/est_v_nt.csv | sort | uniq -c
     14 Archaea
   1565 artificial sequences
    262 Bacteria
     60 dsDNA viruses
  14128 Eukaryota

So most of the hits are to eukaryotic sequences; of those 11 229 “really are” human. What about those 14 hits to Archaea?

grep Archaea data/est_v_nt.csv | cut -f1,2,14,15 -d ","
gi|12397306|gb|BF990981.1|,gi|1145360|gb|U39025.1|MAU39025,Methanococcus aeolicus,Archaea
gi|1441578|gb|N88376.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea
gi|1570770|dbj|C16063.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea
gi|1570774|dbj|C16067.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea
gi|30285459|gb|CB990939.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285481|gb|CB990961.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285514|gb|CB990994.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285523|gb|CB991003.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285523|gb|CB991003.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30285523|gb|CB991003.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30286509|gb|CB991989.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30288113|gb|CB993593.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|30291474|gb|CB997050.1|,gi|288541968|gb|CP001719.1|,Methanobrevibacter ruminantium M1,Archaea
gi|8613918|gb|BE151197.1|,gi|393188278|gb|CP003685.1|,Pyrococcus furiosus COM1,Archaea

An improvement on our original publication, which yielded only one weak candidate. We’re still seeing hits to Pyrococcus furiosus, likely a result of contamination with Pfu polymerase DNA. The other hits are to methanogen genomes sequenced after 2004 although intriguingly, the tissue from which these ESTs were derived is pre-eclamptic placenta; see for example GI 30285459. Methanogens in placenta? Or on the fingers of less-than-hygienic researchers? Spurious hits due to low complexity sequence? “More research required”.


I’m now in a position to run this analysis at regular intervals – probably monthly – and push the results to Github. Watch this space for any interesting developments.

Making this kind of pipeline available and reproducible by others is less easy. If you have access to a machine with all the prerequisites, clone the repository, replace the symlinks to database directories with real directories, change to the directory code/ruby and run the rake tasks – well, you might be able to do the same thing. But you’d probably rather sequence the human microbiome.

One thought on “My own “404 not found”: making amends using Github

  1. Jonathan Badger

    We’re still seeing hits to Pyrococcus furiosus, likely a result of contamination with Pfu polymerase DNA

    Or alternatively, that’s a transcript from some really hot shit (near 100C)…

    But seriously, it’s great that you are revisiting old results, even if like you say there are more modern ways of addressing the issue. But the idea of rerunning old results using new databases to find new things is a great idea that doesn’t get done nearly often enough.

Comments are closed.