Twitter Coverage of the ISMB/ECCB Conference 2017

Search all the hashtags

ISMB (Intelligent Systems for Molecular Biology – which sounds rather old-fashioned now, doesn’t it?) is the largest conference for bioinformatics and computational biology. It is held annually and, when in Europe, jointly with the European Conference on Computational Biology (ECCB).

I’ve had the good fortune to attend twice: in Brisbane 2003 (very enjoyable early in my bioinformatics career, but unfortunately the seed for the “no more southern hemisphere meetings” decision), and in Toronto 2008. The latter was notable for its online coverage and for me, the pleasure of finally meeting in person many members of the online bioinformatics community.

The 2017 meeting (and its satellite meetings) were covered quite extensively on Twitter. My search using a variety of hashtags based on “ismb”, “eccb”, “17” and “2017” retrieved 9052 tweets, which form the basis of this summary at RPubs. Code and raw data can be found at Github.

Usually I just let these reports speak for themselves but in this case, I thought it was worth noting a few points:
Continue reading

Twitter Coverage of the Bioinformatics Open Source Conference 2017

count-words-1July 21-22 saw the 18th incarnation of the Bioinformatics Open Source Conference, which generally precedes the ISMB meeting. I had the great pleasure of attending BOSC way back in 2003 and delivering a short presentation on Bioperl. I knew almost nothing in those days, but everyone was very kind and appreciative.

My trusty R code for Twitter conference hashtags pulled out 3268 tweets and without further ado here is:

The ISMB/ECCB meeting wraps today and analysis of Twitter coverage for that meeting will appear here in due course.

Visualising Twitter coverage of recent bioinformatics conferences

Back in February, I wrote some R code to analyse tweets covering the 2017 Lorne Genome conference. It worked pretty well. So I reused the code for two recent bioinformatics meetings held in Sydney: the Sydney Bioinformatics Research Symposium and the VIZBI 2017 meeting.

So without further ado, here are the reports in markdown format, which display quite nicely when pushed to Github:

and you can dig around in the repository for the Rmarkdown, HTML and image files, if you like.

Update: also available as published reports at RPubs:

Data corruption using Excel: 12+ years and counting

Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.

And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?

Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.

What if there is no tomorrow? There wasn’t one today.

We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.

Virus hosts from NCBI taxonomy: now at Github

After my previous post on extracting virus hosts from NCBI Taxonomy web pages, Pierre wrote:

An excellent idea and here’s my first attempt.

Here’s a count of hosts. By the way NCBI, it’s environment.

cut -f4 virus_host.tsv | sort | uniq -c

    283 algae
    114 archaea
   4509 bacteria
      8 diatom
     51 enviroment
    267 fungi
      1 fungi| plants| invertebrates
      4 human
    761 invertebrates
    181 invertebrates| plants
      7 invertebrates| vertebrates
   3979 plants
    102 protozoa
   6834 vertebrates
 115052 vertebrates| human
     43 vertebrates| human  stool
    225 vertebrates| invertebrates
    656 vertebrates| invertebrates| human

Virus hosts from NCBI Taxonomy web pages

A Biostars question asks whether the information about virus host on web pages like this one can be retrieved using Entrez Utilities.

Pretty sure that the answer is no, unfortunately. Sometimes there’s no option but to scrape the web page, in the knowledge that this approach may break at any time. Here’s some very rough and ready Ruby code without tests or user input checks. It takes the taxonomy UID and returns the host, if there is one. No guarantees now or in the future!


require 'nokogiri'
require 'open-uri'

def get_host(uid)
	url   = "" + uid.to_s
	doc   = Nokogiri::HTML.parse(open(url).read)
	data  = doc.xpath("//td").collect { |x| x.inner_html.split("<br>") }.flatten
	data.each do |e|
		puts $1 if e =~ /Host:\s+<\/em>(.*?)$/


Save as taxhost.rb and supply the UID as first argument. Note: I chose 12345 off the top of my head, imagining that it was unlikely to be a virus and would make a good negative test. Turns out to be a phage!

$ ruby taxhost.rb 12249
$ ruby taxhost.rb 12721
$ ruby taxhost.rb 11709
vertebrates| human
$ ruby taxhost.rb 12345

Analysis of gene expression timecourse data using maSigPro

ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposure

ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposure

About a year ago, I did a little work on a very interesting project which was trying to identify blood-based biomarkers for the early detection of stroke. The data included gene expression measurements using microarrays at various time points after the onset of ischemia (reduced blood supply). I had not worked with timecourse data before, so I went looking for methods and found a Bioconductor package, maSigPro, which did exactly what I was looking for. In combination with ggplot2, it generated some very attractive and informative plots of gene expression over time.

I was very impressed by maSigPro and meant to get around to writing a short guide showing how to use it. So I did finally, using RMarkdown to create the document and here it is. The document also illustrates how to retrieve datasets from GEO using GEOquery and annotate microarray probesets using biomaRt. Hopefully it’s useful to some of you.

I’ll probably do more of this in the future, since publishing RMarkdown to RPubs is far easier than copying, pasting and formatting at WordPress.

Searching for the Steamer retroelement in the ocean metagenome

Location of BLAST (tblastn) hits Mya arenaria GagPol (AIE48224.1) vs GOS contigs

Location of BLAST (tblastn) hits Mya arenaria GagPol (AIE48224.1) vs GOS contigs

Last week, I was listening to episode 337 of the podcast This Week in Virology. It concerned a retrovirus-like sequence element named Steamer, which is associated with a transmissible leukaemia in soft shell clams.

At one point the host and guests discussed the idea of searching for Steamer-like sequences in the data from ocean metagenomics projects, such as the Global Ocean Sampling expedition. Sounds like fun. So I made an initial attempt, using R/ggplot2 to visualise the results.

To make a long story short: the initial BLAST results are not super-convincing, the visualisation could use some work (click image, right, for larger version) and the code/data are all public at Github, summarised in this report. It made for a fun, relatively-quick side project.

Some basics of biomaRt

One of the commonest bioinformatics questions, at Biostars and elsewhere, takes the form: “I have a list of identifiers (X); I want to relate them to a second set of identifiers (Y)”. HGNC gene symbols to Ensembl Gene IDs, for example.

When this occurs I have been known to tweet “the answer is BioMart” (there are often other solutions too) and I’ve written a couple of blog posts about the R package biomaRt in the past. However, I’ve realised that we need to take a step back and ask some basic questions that new users might have. How do I find what marts and datasets are available? How do I know what attributes and filters to use? How do I specify different genome build versions?
Continue reading