Make prettier documents by reusing chunks in RMarkdown

No revelations here, just a little R tip for generating more readable documents.


Original with lots of code at the top

There are times when I want to show code in a document, but I don’t want it to be the first thing that people see. What I want to see first is the output from that code. In this silly example, I want the reader to focus their attention on the result of myFunction(), which is 49.
Academic Karma: a case study in how not to use open data

Update: in response to my feedback, auto-generated profiles without accounts are no longer displayed at Academic Karma. Well done and thanks to them for the rapid response.

A news story in Nature last year caused considerable mirth and consternation in my social networks by claiming that ResearchGate, a “Facebook for scientists”, is widely-used and visited by scientists. Since this is true of nobody that we know, we can only assume that there is a whole “other” sub-network of scientists defined by both usage of ResearchGate and willingness to take Nature surveys seriously.

You might be forgiven, however, for assuming that I have a profile at ResearchGate because here it is. Except: it is not. That page was generated automatically by ResearchGate, using what they could glean about me from bits of public data on the Web. Since they have only discovered about one-third of my professional publications, it’s a gross misrepresentation of my achievements and activity. I could claim the profile, log in and improve the data, but I don’t want to expose myself and everyone I know to marketing spam until the end of time.

One issue with providing open data about yourself online is that you can’t predict how it might be used. Which brings me to Academic Karma.
Presentations online for Bioinformatics FOAM 2015

Off to Melbourne tomorrow for perhaps my favourite annual work event: the Bioinformatics FOAM (Focus on Analytical Methods) meeting, organised by CSIRO.

Unfortunately, but for good reasons, it’s an internal event this year, but I’m putting my presentations online. I’ll be speaking twice; the first for Thursday is called “Online bioinformatics forums: why do we keep asking the same questions?” It’s an informal, subjective survey of the questions that come up again and again at bioinformatics Q&A forums such as Biostars and my attempt to understand why this is the case. Of course one simple answer might be selection bias – we don’t observe the users who came, found that their question already had an answer and so did not ask it again. I’ll also try to articulate my concern that many people view bioinformatics as a collection of recipe-style solutions to specific tasks, rather than a philosophy of how to do biological data analysis.

My second talk on Friday is called “Should I be dead? a very personal genomics.” It’s a more practical talk, outlining how I converted my own 23andMe raw data to VCF format, for use with the Ensembl Variant Effect Predictor. The question for the end – which I’ve left open – is this: as personal genomics becomes commonplace, we’re going to need simple but effective reporting tools that patients and their clinicians can use. What are those tools going to look like?

Looking forward to spending some time in Melbourne and hopefully catching up with this awesome lady.

Better living through informatics: in search of koalas

In 2015, I’d like to write, think and do more about things that I care about. One of those things happens to be the koala. Now, this being a blog about bioinformatics and computational biology, I can’t just start writing about any old thing that takes my fancy…I guess. So in this post I’m going to stretch the definition to include ecological informatics and tell you the story of how I achieved a long-held ambition using one of my favourite online resources, The Atlas of Living Australia. And then we’ll wrap up with a quick survey of the (sorry) state of marsupial genomics.
Problematic cell lines: now in a real database

Back in July, I was complaining about the latest abuse of the word “database” by biologists: the “PDF as database.”

This led to some very productive discussion using PubMed Commons and I’m happy to report that misidentified and contaminated cell lines are now included in the NCBI BioSample database.

As the news release notes, rather alarmingly:

This problem is so common it is thought that thousands of misleading and potentially erroneous papers have been published using cell lines that are incorrectly identified

So it would be useful if there were a direct link between the BioSample record for a cell line and PubMed records in which it was used…
Measuring quality is hard

Four articles. Click on the images for larger versions.

Exhibit A: the infamous “(insert statistical method here)”. Exhibit B: “just make up an elemental analysis“. Exhibit C: a methods paper in which a significant proportion of the text was copied verbatim from a previous article. Finally, exhibit D, which shall be forever known as the “crappy Gabor” paper.

Notice anything?
I think that altmetrics are a great initiative. So long as we’re clear that what’s being measured is attention, not quality.

Create your own gene IDs! No wait. Don’t.

Here’s a new way to abuse biological information: take a list of gene IDs and use them to create a completely fictitious, but very convincing set of microarray probeset IDs.

This one begins with a question at BioStars, concerning the conversion of Affymetrix probeset IDs to gene names. Being a “convert ID X to ID Y” question, the obvious answer is “try BioMart” and indeed the microarray platform ([MoGene-1_0-st] Affymetrix Mouse Gene 1.0 ST) is available in the Ensembl database.

However, things get weird when we examine some example probeset IDs: 73649_at, 17921_at, 18174_at. One of the answers to the question notes that these do not map to mouse.

The data are from GEO series GSE56257. The microarray platform is GPL17777. Description: “This is identical to GPL6246 but a custom cdf environment was used to extract data. The cdf can be found at the link below.”

Uh-oh. Alarm bells.
“Health Hack”: crossing the line between hackfest and unpaid labour

I’ve never attended a hackathon (hack day, hackfest or codefest). My impression of them is that there is, generally, a strong element of “working for the public good”: seeking to use code and data in new ways that maximise benefit and build communities.

Which is why I’m somewhat mystified by the projects on offer at the Sydney HealthHack. They read like tenders for consultants. Unpaid consultants.

The projects – a pedigree drawing tool, a workflow to process microscopy images, a statistical calculator and a mutation discovery pipeline – all describe problems that competent bioinformaticians could solve using existing tools in a relatively short time. For example, off the top of my head, ImageJ or CSIRO’s Workspace might be worth looking at for problem (2). The steps described in problem (4) – copy and paste between spreadsheets, manual inspection and manipulation of sequence data – should be depressingly familiar examples to many bioinformaticians. This project can be summarised simply as “you’re doing it wrong because you don’t know any better.”

The overall tone is “my research group requires this tool, but we’re unable to employ anyone to do it.” There is no sense of anything wider than the immediate needs of individual researchers. This does not seem, to me, what hackfest philosophy is all about.

This raises an issue that I think about a lot: how do we (the science community) best get the people with the expertise (in this case, bioinformaticians) to the people with the problems? In an ideal world the answer would be “everyone should employ at least one.” I wonder about the market (Australian or more generally) for paid consulting “biological data scientists”? We complain that we’re under-valued; well, perhaps it is we who are doing the valuation when we offer our skills for free.