Archive for ‘R’

April 4, 2013

A brief note: R 3.0.0 and bioinformatics

Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).

Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:

f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)

When this happened:

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
  allocMatrix: too many elements specified

I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.

My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.

February 26, 2013

R/ggplot2 tip: aes_string

I’m a big fan of ggplot2. Recently, I ran into a situation which called for a useful feature that I had not used previously: aes_string.
Read the rest…

February 13, 2013

Basic R: rows that contain the maximum value of a variable

File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”

Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig
#    vars obs1 obs2
# 1     A    1   11
# 2     B    2   12
# 3     C    3   13
# 4     D    4   14
# 5     E    5   15
# 6     A    6   16
# 7     B    7   17
# 8     C    8   18
# 9     D    9   19
# 10    E   10   20

Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:
Read the rest…

August 28, 2012

Addendum to yesterday’s post on custom CSS and R Markdown


Updates from RStudio support:
(1) “Thanks for reporting and I was able to reproduce this as well. I’ve filed a bug and we’ll take a look.”
(2) Taking a further look, this is actually a bug in the Markdown package and we’ve asked the maintainer (Jeffrey Horner) to look into it.

As juejung points out in a comment on my previous post, applying custom CSS to R Markdown by sourcing the custom rendering function breaks the rendering of inline equations.

I’ve opened an issue with RStudio support and will update here with their response. In the meantime, one solution to this problem is:

  1. Do not create the files custom.css or style.R, as described yesterday
  2. Instead, just put the custom CSS at the top of your R Markdown file using style tags, as shown below
<style type="text/css">
table {
   max-width: 95%;
   border: 1px solid #ccc;
}

th {
  background-color: #000000;
  color: #ffffff;
}

td {
  background-color: #dcdcdc;
}
</style>
Tags: ,
August 27, 2012

Custom CSS for HTML generated using RStudio

People have been telling me for a while that the latest version of RStudio, the IDE for R, is a great way to generate reports. I finally got around to trying it out and for once, the hype is justified. Start with this excellent tutorial from Jeremy Anglim.

Briefly: the process is not so different to Sweave, except that (1) instead of embedding R code in LaTeX, we embed R code in a document written using R Markdown; (2) instead of Sweave, we use the knitr package; (3) the focus is on generating HTML documents for publishing to the Web (see e.g. RPubs), although knitr can also generate PDF documents, just like Sweave.

It took me a little while to figure out a couple of things. First, how best to generate HTML tables, ideally using the xtable package. Second, how to override the default RStudio/R Markdown style. I’ve documented those tasks in this post.
Read the rest…

Tags: , , ,
August 16, 2012

Twitter coverage of the ISMB 2012 meeting: some statistics

OK, let’s do this: some statistics and visualization of the tweets for ISMB 2012.

Read the rest…

Tags: ,
May 11, 2012

My day out at #osddmalaria

Finally, I get around to telling you that…
…on Friday 24th February, I took a day out from my regular job to attend a meeting on Open Source Drug Discovery for Malaria. I should state straight away that whilst drug discovery and chem(o)informatics are topics that I find very interesting, I have no professional experience or connections in either area. However, it was an opportunity to learn more, listen to some great speakers, think about what bioinformaticians might be able to bring to the table and of course, finally meet Mat Todd in person. Mat, if you don’t know, is one of the few people on the planet who really does science online, as opposed to talking about science online.

Here’s what I learned – with just a little analysis using R later in the post, hence the statistics/R category.
Read the rest…

March 16, 2012

R gotcha for the week

I use the biomaRt package from Bioconductor in almost every R session. So I thought I’d load the library and set up a mart instance in my ~/.Rprofile:

library(biomaRt)
mart.hs <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

On starting R, I was somewhat perplexed to see this error message:

Error in bmVersion(mart, verbose = verbose) : 
  could not find function "read.table"

Twitter to the rescue. @hadleywickham told me to load utils first and @vsbuffalo explained that normally, .Rprofile is read before the utils package is loaded. Seems rather odd to me; I’d have thought that biomaRt should load utils if required, but there you go.

So this works in ~/.Rprofile:

library(utils)
library(biomaRt)
mart.hs <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
March 14, 2012

Simple plots reveal interesting artifacts

I’ve recently been working with methylation data; specifically, from the Illumina Infinium HumanMethylation450 bead chip. It’s a rather complex array which uses two types of probes to determine the methylation state of DNA at ~ 485 000 sites in the genome.

The Bioconductor project has risen to the challenge with a (somewhat bewildering) variety of packages to analyse data from this chip. I’ve used lumi, methylumi and minfi, but will focus here on just the latter.

Update: a related post from Oliver Hofmann (@fiamh).
Read the rest…

December 2, 2011

A Friday round-up

Just a brief selection of items that caught my eye this week. Note that this is a Friday as opposed to Friday, lest you mistake this for a new, regular feature.

1. R/statistics

  • ggbio
  • A new Bioconductor package which builds on the excellent ggplot graphics library, for the visualization of biological data.

  • R development master class
  • Hadley Wickham recently presented this course on R package development for my organisation. I was on parental leave at the time, otherwise I would have attended for sure.

2. Bioinformatics in the media
DNA Sequencing Caught in Deluge of Data

I described this NYT article as a “surprisingly-good intro article“. Michael Eisen described it as “kind of silly“.

I think we’re both right. Michael’s perspective is that of an expert in high-throughput sequencing data; I’m just pleased to see an introduction to bioinformatics for non-specialists in a mainstream newspaper. And I note that they have corrected the figure caption which offended Michael.

As to the “deluge”: yes, there are other sciences that generate more data and yes, we probably don’t need to archive/analyse a lot of the raw data. However, I’d contend that the basic premise of the article is correct: we are sequencing faster than we can analyse. The solution, obviously, is more bioinformaticians.

Follow

Get every new post delivered to your Inbox.

Join 2,204 other followers