Posts tagged ‘microarray’

April 4, 2013

A brief note: R 3.0.0 and bioinformatics

Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).

Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:

f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)

When this happened:

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
  allocMatrix: too many elements specified

I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.

My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.

January 31, 2013

It’s #overlyhonestmethods come to life!

Retraction Watch reports a study of microarray data sharing. The article, published in Clinical Chemistry, is itself behind a paywall despite trumpeting the virtues of open data. So straight to the Open Access Irony Award group at CiteULike it goes.

I was not surprised to learn that the rate of public deposition of data is low, nor that most deposited data ignores standards and much of it is low quality. What did catch my eye though, was a retraction notice for one of the articles from the study, in which the authors explain the reason for retraction.
Read the rest…

March 14, 2012

Simple plots reveal interesting artifacts

I’ve recently been working with methylation data; specifically, from the Illumina Infinium HumanMethylation450 bead chip. It’s a rather complex array which uses two types of probes to determine the methylation state of DNA at ~ 485 000 sites in the genome.

The Bioconductor project has risen to the challenge with a (somewhat bewildering) variety of packages to analyse data from this chip. I’ve used lumi, methylumi and minfi, but will focus here on just the latter.

Update: a related post from Oliver Hofmann (@fiamh).
Read the rest…

January 27, 2011

APIs have let me down part 1/2: ArrayExpress

The API – Application Programming Interface – is, in principle, a wonderful thing. You make a request to a server using a URL and back come lovely, structured data, ready to parse and analyse. We’ve begun to demand that all online data sources offer an API and lament the fact that so few online biological databases do so.

Better though, to have no API at all than one which is poorly implemented and leads to frustration? I’m beginning to think so, after recent experiences on both a work project and one of my “fun side projects”. Let’s start with the work project, an attempt to mine a subset of the ArrayExpress microarray database.
Read the rest…

September 9, 2010

Trust no-one: errors and irreproducibility in public data

Just when I was beginning to despair at the state of publicly-available microarray data, someone sent me an article which…increased my despair.

The article is:

Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology (2009)
Keith A. Baggerly and Kevin R. Coombes
Ann. Appl. Stat. 3(4): 1309-1334

It escaped my attention last year, in part because “Annals of Applied Statistics” is not high on my journal radar. However, other bloggers did pick it up: see posts at Reproducible Research Ideas and The Endeavour.

In this article, the authors examine several papers in their words “purporting to use microarray-based signatures of drug sensitivity derived from cell lines to predict patient response.” They find that not only are the results difficult to reproduce but in several cases, they simply cannot be reproduced due to simple, avoidable errors. In the introduction, they note that:

…a recent survey [Ioannidis et al. (2009)] of 18 quantitative papers published in Nature Genetics in the past two years found reproducibility was not achievable even in principle for 10.

You can get an idea of how bad things are by skimming through the sub-headings in the article. Here’s a selection of them:
Read the rest…

August 30, 2010

GEO database: curation lagging behind submission?

geometadb

GSE and GDS records in GEOmetadb by date

I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.

Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records?

library(GEOmetadb)
library(ggplot2)

# update database if required using getSQLiteFile()
# connect to database; assumed to be in user $HOME
con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite")

# fetch "last updated" dates for GDS and GSE
gds <- dbGetQuery(con, "select update_date from gds")
gse <- dbGetQuery(con, "select last_update_date from gse")

# cumulative sums by date; no factor variables
gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F)
gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F)

# make GDS and GSE data frames comparable
colnames(gds.count) <- "count"
colnames(gse.count) <- "count"

# row names (dates) to real dates
gds.count$date <- as.POSIXct(rownames(gds.count))
gse.count$date <- as.POSIXct(rownames(gse.count))

# add type for plotting
gds.count$type <- "gds"
gse.count$type <- "gse"

# combine GDS and GSE data frames
gds.gse <- rbind(gds.count, gse.count)

# and plot records over time by type
png(filename = "geometadb.png", width = 800, height = 600)
print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type)))
dev.off()
June 7, 2010

biomaRt and GenomeGraphs: a worked example

As promised a few posts ago, another demonstration of the excellent biomaRt package, this time in conjunction with GenomeGraphs.

Here’s what we’re going to do:

  1. Grab some public microarray data
  2. Normalise and get a list of the most differentially-expressed probesets
  3. Use biomaRt to fetch the genes associated with those probesets
  4. Plot the data using GenomeGraphs

If you want to follow along on your own machine, it will need to be quite powerful. We’ll be processing exon arrays, which requires a 64-bit machine with at least 4 GB (and preferably, at least 8-12 GB) of RAM. As usual, I’m assuming some variant of Linux and that you’re comfortable at the command line.
Read the rest…

January 27, 2010

Clustering GEO samples by title (briefly) revisited

So, we had a brief discussion regarding my previous post and clearly the statement:

The longest key for which values exist classifies your titles

does not hold true for all cases. Not that I ever said that it did! I remind you that this blog is a place for the half-formed ideas that spill out of my head, not an instruction manual.

Let’s look, for example, at GSE19318. This GEO series comprises 2 platforms: one for dog (10 samples) and one for humans (1 sample), with these sample titles:

['Dog-tumor-81705', 'Dog-tumor-78709', 'Dog-tumor-88012', 'Dog-tumor-8888302', 'Dog-tumor-209439', 'Dog-tumor-212227', 'Dog-tumor-48', 'Dog-tumor-125', 'Dog-tumor-394', 'Dog-tumor-896', 'Human-tumor']

Run that through the Ruby code in my last post and we get:

{"Dog-tumor-48"=>["Dog-tumor-48"], "Dog-tumor-81"=>["Dog-tumor-81705"], "Dog-tumor-39"=>["Dog-tumor-394"], "Dog-tumor-20"=>["Dog-tumor-209439"], "Dog-tumor-21"=>["Dog-tumor-212227"], "Dog-tumor-88"=>["Dog-tumor-88012", "Dog-tumor-8888302"], "Dog-tumor-89"=>["Dog-tumor-896"], "Dog-tumor-12"=>["Dog-tumor-125"], "Dog-tumor-78"=>["Dog-tumor-78709"]}

Whoa. That went badly wrong! However, it’s easy to see why. With only one human sample, value.length is not more than one, so that sample disappears altogether. For the dog samples, the longest key is not the key that contains all samples, due to the title naming scheme.

We might try instead to maximize the value length – that is, the array value which contains the most samples:

  # longest value.length
  count = 0
  hash.each_pair do |key,value|
    count = value.length if count < value.length
  end
  # delete unwanted keys
  hash.delete_if { |key,value| value.length != count }
  return hash

Which will give us a choice of “dog sample keys”, but still drops the human sample:

["Dog-tumo", "Do", "Dog-tumor-", "D", "Dog-tum", "Dog-t", "Dog", "Dog-tu", "Dog-tumor", "Dog-"]

Other things we might try:

  • Partition sample titles by platform before trying to partition samples by title
  • Don’t delete any hash keys based on key/value length; just present all options to the user
  • Decide that sample partitioning by title is a poor idea and try a different approach

As ever, life would be much easier if GEO samples were titled or described in some logical, parse-able fashion.

January 24, 2010

“Thinking algorithmically”: clustering GEO samples by title

Today’s challenge. Take a look at this array, which contains the “title” field for the 6 samples from GSE1323, a series in the GEO microarray database:

['SW-480-1','SW-480-2','SW-480-3','SW-620-1','SW-620-2','SW-620-3']

Humans are very good at classification. Almost instantly, you’ll see that there are 2 classes, “SW-480″ and “SW-620″, each with 3 samples. How can we write a program to do the same job?
Read the rest…

January 8, 2010

Samples per series/dataset in the NCBI GEO database

Andrew asks:

I want to get an NCBI GEO report showing the number of samples per series or data set. Short of downloading all of GEO, anyone know how to do this? Is there a table of just metadata hidden somewhere?

At work, we joke that GEO is the only database where data goes in, but it won’t come out. However, there is an alternative: the GEOmetadb package, available from Bioconductor.

The R code first, then some explanation:

# install GEOmetadb
source("http://bioconductor.org/biocLite.R")
biocLite("GEOmetadb")
library(GEOmetadb)

# connect to database
getSQLiteFile()
con <- dbConnect(SQLite(), "GEOmetadb.sqlite")

# count samples per GDS
gds.count <- dbGetQuery(con, "select gds,sample_count from gds")
gds.count[1:5,]
# first 5 results
     gds sample_count
1   GDS5            5
2   GDS6           29
3  GDS10           28
4  GDS12            8
5  GDS15            6
# count samples per GSE
gse <- dbGetQuery(con, "select series_id from gsm")
gse.count <- as.data.frame(table(gse$series_id))
gse.count[1:10,]
# first 10 results
                Var1 Freq
1               GSE1   38
2              GSE10    4
3             GSE100    4
4           GSE10000   29
5           GSE10001   12
6           GSE10002    8
7           GSE10003    4
8  GSE10004,GSE10114    3
9           GSE10005   48
10          GSE10006   75

We install GEOmetadb (lines 2-4), then download and unpack the SQLite database (line 7). This generates the file ~/GEOmetadb.sqlite, which is currently a little over 1 GB.

Next, we connect to the database via RSQLite (lines 7-8). The gds table contains GDS dataset accession and sample count, so extracting that information is very easy (line 11).

GSE series are a little different. The gsm table contains GSM sample accession and GSE series accession (in the series_id field). We can count up the samples per series using table(), on line 22. However, this generates some odd-looking results, such as:

          Var1          Freq
15    GSE10011,GSE10026 45
14652 GSE9973,GSE10026   9
14654 GSE9975,GSE10026  36
14656 GSE9977,GSE10026  24

Fear not. In this case, GSE10026 is a super-series comprised from the series GSE10011 (45 samples), GSE9973 (9 samples), GSE9975 (36 samples) and GSE9977 (24 samples), total = 114 samples.

Follow

Get every new post delivered to your Inbox.

Join 2,204 other followers