When dealing with data from high-throughput experimental platforms such as microarrays, it’s important to account for potential batch effects. A simple example: if you process all your normal tissue samples this week and your cancerous tissue samples next week, you’re in big trouble. Differences between cancer and normal are now confounded with processing time and you may as well start over with new microarrays.
Processing date is often a good surrogate for batch and it was once easy to extract dates from Affymetrix CEL files using Bioconductor. It seems that this is no longer the case.
Read the rest…
Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).
Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:
f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)
When this happened:
Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData, :
allocMatrix: too many elements specified
I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.
My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.