Location of BLAST (tblastn) hits Mya arenaria GagPol (AIE48224.1) vs GOS contigs
Last week, I was listening to episode 337
of the podcast This Week in Virology
. It concerned a retrovirus-like sequence element named Steamer
, which is associated with a transmissible leukaemia in soft shell clams.
At one point the host and guests discussed the idea of searching for Steamer-like sequences in the data from ocean metagenomics projects, such as the Global Ocean Sampling expedition. Sounds like fun. So I made an initial attempt, using R/ggplot2 to visualise the results.
To make a long story short: the initial BLAST results are not super-convincing, the visualisation could use some work (click image, right, for larger version) and the code/data are all public at Github, summarised in this report. It made for a fun, relatively-quick side project.
I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed).
HEp-2 or not HEp2?
Anyway, we have something better. I was exploring PubMed Commons
, which is becoming a very good resource. The top-featured comment
looks very interesting (see image, right).
Intrigued, I went to investigate the Database of Cross-contaminated or Misidentified Cell Lines, hovered over the download link and saw that it’s – wait for it – a PDF. I’ll say that again. The “database” is a PDF.
The sad thing is that this looks like very useful, interesting information which I’m sure would be used widely if presented in an appropriate (open) format and better-publicised. Please, biological science, stop embarrassing yourself. If you don’t know how to do data properly, talk to someone who does.
File under “interesting articles that I don’t have time to write about at length.”
- Archaea and Fungi of the Human Gut Microbiome: Correlations with Diet and Bacterial Residents
Long ago, before metagenomics and NGS, I did a little work on detection of Archaea in human microbiomes. There’s a blog post in the pipeline about that but until then, enjoy this article in PLoS ONE.
- Mutational heterogeneity in cancer and the search for new cancer-associated genes
This article is getting a lot of attention on Twitter this week. Brief summary: cancer cells are really messed up in all sorts of ways, most of which are not causal with respect to the cancer. Anyone who has ever looked at microarray data knows that it’s not uncommon for 50% or more of genes to show differential expression in a cancer/normal comparison, so this is hardly a new concept. I think we need to move away from ever-more detailed characterizations of the ways in which cancer cells are “messed up.” We know that they are and that doesn’t provide much insight, in my opinion.
- The vast majority of statistical analysis is not performed by statisticians
Interesting post by Jeff Leek, summarized very well by its title. It points out that many more people are now interested in data analysis, many of them are not trained professionally as statisticians (I’m in this category myself) and we need to recognize and plan for that.
Bonus post doing the rounds of social media: Using Metadata to Find Paul Revere. Social network analysis, 18th-century style. Amusing, informative and topical.
June 23, 2004. BMC Bioinformatics publishes “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”. We roll our eyes. Do people really do that? Is it really worthy of publication? However, we admit that if it happens then it’s good that people know about it.
October 17, 2012. A colleague on our internal Yammer network writes:
Read the rest…
23andme have been blogging for a while, but activity has recently picked up. Entitled “The spittoon” (tagline: more than you’ve come to expectorate…nice one), a recent post is bluntly headed “Why science can’t share” and points us to this NYT article by a cancer biostatistician on the difficulties in accessing raw biomedical data.
Update: the NYT article was free when I posted this, but now requires login. Ah, the irony…
The 23andme post is filed, quite appropriately and correctly, under “big questions”. A blog worth keeping an eye on.