April 4, 2013
Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).
Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:
f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)
When this happened:
Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData, :
allocMatrix: too many elements specified
I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.
My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.
Posted in bioinformatics, programming, R, statistics |
1 Comment »
February 26, 2013
I’m a big fan of ggplot2. Recently, I ran into a situation which called for a useful feature that I had not used previously: aes_string.
Read the rest…
Posted in programming, R, research diary, statistics |
Leave a Comment »
February 13, 2013
File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”
Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:
df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig
# vars obs1 obs2
# 1 A 1 11
# 2 B 2 12
# 3 C 3 13
# 4 D 4 14
# 5 E 5 15
# 6 A 6 16
# 7 B 7 17
# 8 C 8 18
# 9 D 9 19
# 10 E 10 20
Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:
Read the rest…
Posted in programming, R, research diary, statistics |
9 Comments »
January 31, 2013
Retraction Watch reports a study of microarray data sharing. The article, published in Clinical Chemistry, is itself behind a paywall despite trumpeting the virtues of open data. So straight to the Open Access Irony Award group at CiteULike it goes.
I was not surprised to learn that the rate of public deposition of data is low, nor that most deposited data ignores standards and much of it is low quality. What did catch my eye though, was a retraction notice for one of the articles from the study, in which the authors explain the reason for retraction.
Read the rest…
Posted in bioinformatics, publications, statistics |
1 Comment »
August 28, 2012
Updates from RStudio support:
(1) “Thanks for reporting and I was able to reproduce this as well. I’ve filed a bug and we’ll take a look.”
(2) Taking a further look, this is actually a bug in the Markdown package and we’ve asked the maintainer (Jeffrey Horner) to look into it.
As juejung points out in a comment on my previous post, applying custom CSS to R Markdown by sourcing the custom rendering function breaks the rendering of inline equations.
I’ve opened an issue with RStudio support and will update here with their response. In the meantime, one solution to this problem is:
- Do not create the files custom.css or style.R, as described yesterday
- Instead, just put the custom CSS at the top of your R Markdown file using style tags, as shown below
<style type="text/css">
table {
max-width: 95%;
border: 1px solid #ccc;
}
th {
background-color: #000000;
color: #ffffff;
}
td {
background-color: #dcdcdc;
}
</style>
Posted in programming, R, research diary, statistics |
2 Comments »
August 27, 2012
People have been telling me for a while that the latest version of RStudio, the IDE for R, is a great way to generate reports. I finally got around to trying it out and for once, the hype is justified. Start with this excellent tutorial from Jeremy Anglim.
Briefly: the process is not so different to Sweave, except that (1) instead of embedding R code in LaTeX, we embed R code in a document written using R Markdown; (2) instead of Sweave, we use the knitr package; (3) the focus is on generating HTML documents for publishing to the Web (see e.g. RPubs), although knitr can also generate PDF documents, just like Sweave.
It took me a little while to figure out a couple of things. First, how best to generate HTML tables, ideally using the xtable package. Second, how to override the default RStudio/R Markdown style. I’ve documented those tasks in this post.
Read the rest…
Posted in programming, R, research diary, statistics |
9 Comments »
August 16, 2012
OK, let’s do this: some statistics and visualization of the tweets for ISMB 2012.
Read the rest…
Posted in bioinformatics, meetings, R, statistics |
6 Comments »
July 23, 2012
Update: as pointed out in the comments, the amusing error in this article has been “corrected” (or at least, “edited away”). Thanks for your interest.
Update: I note that this article is now “Highly Accessed” ;)
An integrative analysis of DNA methylation and RNA-Seq data for human heart, kidney and liver
BMC Systems Biology 2011, 5(Suppl 3):S4

(insert statistical method here). No, really.
With thanks to Simon J Greenhill and Dave Winter.
Posted in bioinformatics, publications, statistics |
13 Comments »
May 11, 2012
Finally, I get around to telling you that…
…on Friday 24th February, I took a day out from my regular job to attend a meeting on Open Source Drug Discovery for Malaria. I should state straight away that whilst drug discovery and chem(o)informatics are topics that I find very interesting, I have no professional experience or connections in either area. However, it was an opportunity to learn more, listen to some great speakers, think about what bioinformaticians might be able to bring to the table and of course, finally meet Mat Todd in person. Mat, if you don’t know, is one of the few people on the planet who really does science online, as opposed to talking about science online.
Here’s what I learned – with just a little analysis using R later in the post, hence the statistics/R category.
Read the rest…
Posted in meetings, open science, R, statistics |
Comments Off
March 16, 2012
I use the biomaRt package from Bioconductor in almost every R session. So I thought I’d load the library and set up a mart instance in my ~/.Rprofile:
library(biomaRt)
mart.hs <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
On starting R, I was somewhat perplexed to see this error message:
Error in bmVersion(mart, verbose = verbose) :
could not find function "read.table"
Twitter to the rescue. @hadleywickham told me to load utils first and @vsbuffalo explained that normally, .Rprofile is read before the utils package is loaded. Seems rather odd to me; I’d have thought that biomaRt should load utils if required, but there you go.
So this works in ~/.Rprofile:
library(utils)
library(biomaRt)
mart.hs <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
Posted in programming, R, research diary, statistics |
1 Comment »