Virus hosts from NCBI taxonomy: now at Github

After my previous post on extracting virus hosts from NCBI Taxonomy web pages, Pierre wrote:

An excellent idea and here’s my first attempt.

Here’s a count of hosts. By the way NCBI, it’s environment.

cut -f4 virus_host.tsv | sort | uniq -c

   1301 
    283 algae
    114 archaea
   4509 bacteria
      8 diatom
     51 enviroment
    267 fungi
      1 fungi| plants| invertebrates
      4 human
    761 invertebrates
    181 invertebrates| plants
      7 invertebrates| vertebrates
   3979 plants
    102 protozoa
   6834 vertebrates
 115052 vertebrates| human
     43 vertebrates| human  stool
    225 vertebrates| invertebrates
    656 vertebrates| invertebrates| human

Virus hosts from NCBI Taxonomy web pages

A Biostars question asks whether the information about virus host on web pages like this one can be retrieved using Entrez Utilities.

Pretty sure that the answer is no, unfortunately. Sometimes there’s no option but to scrape the web page, in the knowledge that this approach may break at any time. Here’s some very rough and ready Ruby code without tests or user input checks. It takes the taxonomy UID and returns the host, if there is one. No guarantees now or in the future!

#!/usr/bin/ruby

require 'nokogiri'
require 'open-uri'

def get_host(uid)
	url   = "http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&lvl=3&lin=f&keep=1&srchmode=1&unlock&id=" + uid.to_s
	doc   = Nokogiri::HTML.parse(open(url).read)
	data  = doc.xpath("//td").collect { |x| x.inner_html.split("<br>") }.flatten
	data.each do |e|
		puts $1 if e =~ /Host:\s+<\/em>(.*?)$/
	end
end

get_host(ARGV[0])

Save as taxhost.rb and supply the UID as first argument. Note: I chose 12345 off the top of my head, imagining that it was unlikely to be a virus and would make a good negative test. Turns out to be a phage!

$ ruby taxhost.rb 12249
plants
$ ruby taxhost.rb 12721
vertebrates
$ ruby taxhost.rb 11709
vertebrates| human
$ ruby taxhost.rb 12345
bacteria

Analysis of gene expression timecourse data using maSigPro

ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposure

ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposure

About a year ago, I did a little work on a very interesting project which was trying to identify blood-based biomarkers for the early detection of stroke. The data included gene expression measurements using microarrays at various time points after the onset of ischemia (reduced blood supply). I had not worked with timecourse data before, so I went looking for methods and found a Bioconductor package, maSigPro, which did exactly what I was looking for. In combination with ggplot2, it generated some very attractive and informative plots of gene expression over time.

I was very impressed by maSigPro and meant to get around to writing a short guide showing how to use it. So I did finally, using RMarkdown to create the document and here it is. The document also illustrates how to retrieve datasets from GEO using GEOquery and annotate microarray probesets using biomaRt. Hopefully it’s useful to some of you.

I’ll probably do more of this in the future, since publishing RMarkdown to RPubs is far easier than copying, pasting and formatting at WordPress.

Searching for the Steamer retroelement in the ocean metagenome

Location of BLAST (tblastn) hits Mya arenaria GagPol (AIE48224.1) vs GOS contigs

Location of BLAST (tblastn) hits Mya arenaria GagPol (AIE48224.1) vs GOS contigs

Last week, I was listening to episode 337 of the podcast This Week in Virology. It concerned a retrovirus-like sequence element named Steamer, which is associated with a transmissible leukaemia in soft shell clams.

At one point the host and guests discussed the idea of searching for Steamer-like sequences in the data from ocean metagenomics projects, such as the Global Ocean Sampling expedition. Sounds like fun. So I made an initial attempt, using R/ggplot2 to visualise the results.

To make a long story short: the initial BLAST results are not super-convincing, the visualisation could use some work (click image, right, for larger version) and the code/data are all public at Github, summarised in this report. It made for a fun, relatively-quick side project.

Some basics of biomaRt

One of the commonest bioinformatics questions, at Biostars and elsewhere, takes the form: “I have a list of identifiers (X); I want to relate them to a second set of identifiers (Y)”. HGNC gene symbols to Ensembl Gene IDs, for example.

When this occurs I have been known to tweet “the answer is BioMart” (there are often other solutions too) and I’ve written a couple of blog posts about the R package biomaRt in the past. However, I’ve realised that we need to take a step back and ask some basic questions that new users might have. How do I find what marts and datasets are available? How do I know what attributes and filters to use? How do I specify different genome build versions?
Continue reading

R 3.1 -> 3.2 upgrade notes

My machines upgraded from R version 3.1.3 to version 3.2.0 last week, which means that existing code suddenly cannot find packages and so fails. Some notes to myself, possibly useful to others, for what to do when this happens. Relevant to Ubuntu-based systems (I use Linux Mint).

1. Update packages

cp ~/R/x86_64-pc-linux-gnu-library/3.1 ~/R/x86_64-pc-linux-gnu-library/3.2
update.packages(checkBuilt=TRUE, ask=FALSE)

1.1. rJava issues
My rJava installation failed because code was trying to compile against jni.h which was not present on my system. Solution:

sudo apt-get install openjdk-7-jdk
sudo R CMD javareconf

and then in R:

install.packages("rJava")

2. Update Bioconductor
Bioconductor is also upgraded so requires more than a package update. Probably need a new R session for this one.

remove.packages("BiocInstaller")
source("http://bioconductor.org/biocLite.R")
biocLite()

2.1. ChemmineR
My Bioconductor Chemminer update failed because package gridExtra was absent:

install.packages("gridExtra")
biocLite("ChemmineR")

3. General issues
When R is installed on Linux Mint, some packages are installed by default in /usr/lib/R/library. When performing updates as a non-root user, you’ll see messages telling you that this location is not writable and asking if you want to use your own library location. If you reply “yes”, you’ll have packages in both system and user locations. It’s probably better to say “no” and let the Ubuntu package management system handle the package upgrades…although when I tried that, the entire upgrade process halted…

And now we are all done so (careful!):

rm -rf ~/R/x86_64-pc-linux-gnu-library/3.1

Project Tycho, ggplot2 and the shameless stealing of blog ideas

Last week, Mick Watson posted a terrific article on using R to recreate the visualizations in this WSJ article on the impact of vaccination. Someone beat me to the obvious joke.

Someone also beat me to the standard response whenever base R graphics are used.

And despite devoting much of Friday morning to it, I was beaten to publication of a version using ggplot2.

Why then would I even bother to write this post. Well, because I did things a little differently; diversity of opinion and illustration of alternative approaches are good. And because on the internet, it’s quite acceptable to appropriate great ideas from other people when you lack any inspiration yourself. And because I devoted much of Friday morning to it.

Here then is my “exploration of what Mick did already, only using ggplot2 like Ben did already.”
Continue reading

Exploring the NCBI taxonomy database using Entrez Direct

I’ve been meaning to write about Entrez Direct, henceforth called edirect, for some time. This tweet provided me with an excuse:

This post is not strictly the answer to that question. Instead we’ll ask: which parent IDs of records for insects in the NCBI Taxonomy database have the most species IDs?
Continue reading

Configuring the R BatchJobs package for Torque batch queues

I was asked recently to look at some R code which performs “embarrassingly parallel” computations (the same function, multiple times, different parameters) and see whether I could modify it to run on one of our high-performance computing clusters. The machine has 63 virtual compute nodes and uses the TORQUE batch queue system to allocate nodes to compute jobs.

First stop: the CRAN Task View High-Performance and Parallel Computing with R. Two promising packages there: BatchJobs and BatchExperiments. Their documentation is quite extensive with useful examples, but I found it a little disjointed and confusing. What I wanted was a simple, step-by-step guide to setting up for a first-time user. So here is my attempt. As always, it’s for “Linux-like” systems.
Continue reading

Please read “It’s time to reboot bioinformatics education”

I guess I’ve been around bioinformatics for the best part of 15 years. In that time, I’ve seen almost no improvement in the way biologists handle and use data. If anything I’ve seen a decline, perhaps because the data have become larger and more complex with no improvement in the skills base.

It strikes me when I read questions at Biostars that the problem faced by many students and researchers is deeper than “not knowing what to do.” It’s having no idea how to figure out what they need to know in order to do what they want to do. In essence, this is about how to get people into a problem-solving mindset so as they’re aware, for example that:

  • it’s extremely unlikely that you are the first person to encounter this problem
  • it’s likely that the solution is documented somewhere
  • effective search will lead you to a solution even if you don’t fully understand it at first
  • the tool(s) that you know are not necessarily the right ones for the job (and Excel is never the right tool for the job)
  • implementing the solution may require that you (shudder) learn new skills
  • time spent on those skills now is almost certainly time saved later because…
  • …with a very little self-education in programming, tasks that took hours or days can be automated and take seconds or minutes

It’s good (and bad) to know that these issues are not confined to Australian researchers: here is It’s time to reboot bioinformatics education by Todd Harris. It is excellent and you should go and read it as soon as possible.