It’s always nice when 12-month old code runs without a hitch. Not sure why this did not become a Github repo first time around, but now it is: my RMarkdown code to generate a report using data from the Nobel Prize API.
Now you too can generate a “gee, it’s all old white men” chart as seen in The Economist – Greying of the Nobel laureates, BBC News – Why are Nobel Prize winners getting older? and no doubt, many other outlets every year including me at RPubs, updated from 2015. As for myself, perhaps I should be offering my services to news outlets instead of publishing on blogs and obscure web platforms :)
I must have a minor reputation as a critic of Excel in bioinformatics, since strangers are now sending contributions to my work email address. Thanks, you know who you are!
When asked why I didn’t mask this email address, I replied “the authors didn’t”
This week: Online Survival Analysis Software to Assess the Prognostic Value of Biomarkers Using Transcriptomic Data in Non-Small-Cell Lung Cancer
. Scroll on down to supporting Table S1
and right there on the page, staring you in the face is a rather unusual-looking microarray probeset ID.
I wonder if we should start collecting notable examples in one place?
To be fair, this is more human error than an issue with Excel per se, but I’m going to argue that using Excel promotes sloppy data management errors by making minds lazy :)
I’ve been complaining about this for years. They fixed it. The NCBI have reorganised their genomes FTP site and finally, Archaea are not lumped in with Bacteria.
Archaea are still included in the ASSEMBLY_BACTERIA directory; hopefully that’s next on the list.
[*] to be fair, they’ve always recognised Archaea – just not in a form that makes downloads convenient
A DOI, this morning
When I arrive at work, the first task for the day is “check feeds”. If I’m lucky, in the “journal TOCs” category, there will be an abstract that looks interesting, like this one on the left (click for larger version).
Sometimes, the title is a direct link to the article at the journal website. Often though, the link is a Digital Object Identifier or DOI. Frequently, when the article is labelled as “advance access” or “early”, clicking on the DOI link leads to a page like the one below on the right.
In the grand scheme of things I suppose this rates as “minor annoyance”; it means that I have to visit the journal website and search for the article in question. The question is: why does this happen? I’m not familiar with the practical details of setting up a DOI, but I assume that the journal submits article URLs to the DOI system for processing. So who do I blame – journals, for making URLs public before the DOI is ready, or the DOI system, for not processing new URLs quickly enough?
There’s also the issue of whether terms like “advance access” have any meaning in the era of instant, online publishing but that’s for another day.
A couple of years ago, I noted that some journals were not making the process of commenting on articles especially easy. My latest experience suggests that little has changed.
Read the rest…
Three blog posts have been sitting in my drafts folder for a year. Inspired by Andrew’s post on posts that never made it, I’d like to describe them briefly, before I hit “delete” and move on.
Read the rest…
As I’m a biologist, rather than an inorganic chemist or a mineralogist, I don’t have much (well, any) need to look at crystal structures of simple inorganic compounds. Just as well…
…our story begins at Twitter, where David Bradley asks:
Anyone know where to find crystal structures of sodium hypochlorite and sodium bisulfate (cif files or similar) ? #science #crystal
Never thought about it, you say, but surely it can’t be very difficult. So you head to Google and try searches such as “inorganic crystal structure database”. Where you unearth two main players: the Inorganic Crystal Structure Database (ICSD) and the Cambridge Structural Database (CSD). Both are private, requiring registration, login and in one case, installation of an X-client.
Coming from bioinformatics where comparable resources such as the PDB are freely-available via web interfaces, I find this utterly perplexing. Why do these research communities stand for it? Is anyone developing free, open alternatives?
This is a little odd – the tale of the publication that isn’t.
Update: the “missing article” surfaced in my RSS reader on Nov 1; here’s the link
Read the rest…
I’m with Ogden Nash who said:
I love the baby giant panda,
I’d welcome one to my veranda
This week, I learned via Keith that Chinese scientists announced the completion of the giant panda genome. An impressive achievement, given that the project was announced in March this year, but what exactly has been completed? Has the genome been sequenced – that is, there are strings of A, C, G and T covering most chromosomes, or mapped – that is, the approximate chromosomal location of most genes determined? The media seem unsure.
And so on. Here’s a Google News search with more hits.
So what has been achieved – sequencing or mapping? If the former, is it really complete (I doubt this) or draft – and if draft, what kind of quality? And where are the data? Nothing in the genome project section of NCBI as yet.
In the midst of preparing a talk for next Monday. It occurred to me that perhaps we don’t see more protein structure-based prediction in bioinformatics because – there aren’t enough structures.
Sure, the PDB has grown a lot in the past 5 years or so and 53 103 structures (as of now) looks impressive. However, if you’re interested in protein-protein interaction, you want at least 2 chains: which more or less halves the dataset. If you want two different
protein chains, you lose almost another 75%. Let’s specify a reasonable minimum resolution for X-ray diffraction data and there go ~ 3 000 entries. We probably don’t want multiple, similar proteins so let’s remove sequence identity at a redundancy of 90%. We’re left with about 2% of the original PDB, which might be useable for looking at interactions.
No wonder that most bioinformatics focuses on sequences and high-throughput interaction data.