Before we start: yes, we’ve been here before. There was the Biostars question “Calculating Time From Submission To Publication / Degree Of Burden In Submitting A Paper.” That gave rise to Pierre’s excellent blog post and code + data on Figshare.
So why are we here again? 1. It’s been a couple of years. 2. This is the R (+ Ruby) version. 3. It’s always worth highlighting how the poor state of publicly-available data prevents us from doing what we’d like to do. In this case the interesting question “which bioinformatics journal should I submit to for rapid publication?” becomes “here’s an incomplete analysis using questionable data regarding publication dates.”
Let’s get it out of the way then.
File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”
Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.
es <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP] 2013[PDAT]", usehistory = "y")
#  117
117 articles. Now let’s fetch the records in XML format.
xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey,
rettype = "xml", retmax = es$count)
Next question: which XML element specifies the “Date of publication” (PDAT)?
Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.
I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.
2014-09-26 note: when Wikipedia pages change, as this one has, code breaks, as this code has; updates maintained at Github
I must have a minor reputation as a critic of Excel in bioinformatics, since strangers are now sending contributions to my work email address. Thanks, you know who you are!
When asked why I didn’t mask this email address, I replied “the authors didn’t”
This week: Online Survival Analysis Software to Assess the Prognostic Value of Biomarkers Using Transcriptomic Data in Non-Small-Cell Lung Cancer
. Scroll on down to supporting Table S1
and right there on the page, staring you in the face is a rather unusual-looking microarray probeset ID.
I wonder if we should start collecting notable examples in one place?
To be fair, this is more human error than an issue with Excel per se, but I’m going to argue that using Excel promotes sloppy data management errors by making minds lazy :)
I’ve been complaining about this for years. They fixed it. The NCBI have reorganised their genomes FTP site and finally, Archaea are not lumped in with Bacteria.
Archaea are still included in the ASSEMBLY_BACTERIA directory; hopefully that’s next on the list.
[*] to be fair, they’ve always recognised Archaea – just not in a form that makes downloads convenient
Ever wondered whether the “>” symbol can, or does, appear in FASTA sequence headers at positions other than the start of the line?
I have a recent copy of the NCBI non-redundant nucleotide (nt) BLAST database on my server, so let’s take a look. The files are in a directory which is also named nt:
# %i = sequence ID; %t = sequence title
blastdbcmd -db /data/db/nt/nt -entry all -outfmt '%i %t' | grep ">" > ~/Documents/nt.txt
wc -l ~/Documents/nt.txt
# => 54451 /home/sau103/Documents/nt.txt
# and how many sequences total in nt?
blastdbcmd -list /data/db/nt/ -list_outfmt '%n' | head -1
# => 23745273
Short answer – yes, about 0.23% of nt sequence descriptions contain the “>” character. Inspection of the output shows that it’s used in a number of ways. A few examples:
# as "brackets" (very common)
emb|V01351.1| Sea urchin fragment, 3' to the actin gene in <SPAC01>
gb|GU086899.1| Cotesia sp. jft03 voucher BIOUG<CAN>:CPWH-0042 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial
# to indicate mutated bases or amino acids
gb|M21581.1|SYNHUMUBA Synthetic human ubiquitin gene (Thr14->Cys), complete cds
dbj|AB047520.1| Homo sapiens gene for PER3, exon 1, 128+22(G->A) polymorphism
# in chemical nomenclature
gb|AF134414.1|AF134414 Homo sapiens B-specific alpha 1->3 galactosyltransferase (ABO) mRNA, ABO-*B101 allele, complete cds
# as "arrows"
gb|EU303182.1| Apoi virus note kitaoka-> canals ->NIMR nonstructural protein 5 (NS5) gene, partial cds
ref|XM_001734501.1| Entamoeba dispar SAW760 5'->3' exoribonuclease, putative EDI_265520 mRNA, complete cds
Something to keep in mind when writing code to process FASTA format.
6-way Venn banana
I thought nothing could top the classic “6-way Venn banana
“, featured in The banana (Musa acuminata) genome and the evolution of monocotyledonous plants
That is until I saw Figure 3 from Compact genome of the Antarctic midge is likely an adaptation to an extreme environment.
5-way Venn roadkill
What’s odd is that Figure 2 in the latter paper is a nice, clear R/ggplot2 creation, using facet_grid(), so someone knew what they were doing.
That aside, the Antarctic midge paper is an interesting read; go check it out.
This led to some amusing Twitter discussion which pointed me to *A New Rose : The First Simple Symmetric 11-Venn Diagram.
[*] +1 for referencing The Damned, if indeed that was the intention.
Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.
I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.
“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.
I present what follows as “a typical day in the life of a bioinformatician”.
I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed).
HEp-2 or not HEp2?
Anyway, we have something better. I was exploring PubMed Commons
, which is becoming a very good resource. The top-featured comment
looks very interesting (see image, right).
Intrigued, I went to investigate the Database of Cross-contaminated or Misidentified Cell Lines, hovered over the download link and saw that it’s – wait for it – a PDF. I’ll say that again. The “database” is a PDF.
The sad thing is that this looks like very useful, interesting information which I’m sure would be used widely if presented in an appropriate (open) format and better-publicised. Please, biological science, stop embarrassing yourself. If you don’t know how to do data properly, talk to someone who does.