Category Archives: statistics

Quilt plots. Like heat maps, only…heat maps

Stephen tweets:

A "quilt plot"

A “quilt plot”


Quilt plots. Sounds interesting. The link points to a short article in PLoS ONE, containing a table and a figure. Here is Figure 1.

If you looked at that and thought “Hey, that’s a heat map!”, you are correct. That is a heat map. Let’s be quite clear about that. It’s a heat map.

So, how do the authors justify publishing a method for drawing heat maps and then calling them “quilt plots”?
Read the rest…

R: how not to use savehistory() and source()

Admitting to stupidity is part of the learning process. So in the interests of public education, here’s something stupid that I did today.

You’re working in the R console. Happy with your exploratory code, you decide to save it to a file.

savehistory(file = "myCode.R")

Then, you type something else, for example:

ls()
# more lines here

And then, decide that you should save again:

savehistory(file = "myCode.R")

You quit the console. Returning to it later, you recall that you saved your code and so can simply run source() to get back to the same point:

source("myCode.R")

Unfortunately, you forget that the sourced file now contains the savehistory() command. Result: since your new history contains only the single line source() command, then that is what gets saved back to the file, replacing all of your lovely code.

Possible solutions include:

  • Remember to edit the saved file, removing or commenting out any savehistory() lines
  • Generate a file name for savehistory() based on a timestamp so as not to overwrite each time
  • Suggested by Scott: include a prompt in the code before savehistory()

Bacteria and Alzheimer’s disease: I just need to know if ten patients are enough

You can guarantee that when scientists publish a study titled:

Determining the Presence of Periodontopathic Virulence Factors in Short-Term Postmortem Alzheimer’s Disease Brain Tissue

a newspaper will publish a story titled:

Poor dental health and gum disease may cause Alzheimer’s

Without access to the paper, it’s difficult to assess the evidence. I suggest you read Jonathan Eisen’s analysis of the abstract. Essentially, it makes two claims:

  • that cultured astrocytes (a type of brain cell) can adsorb and internalize lipopolysaccharide (LPS) from Porphyromonas gingivalis, a bacterium found in the mouth
  • that LPS was also detected in brain tissue from 4/10 Alzheimer’s disease (AD) cases, but not in tissue from 10 matched normal brains

Regardless of the biochemistry – which does not sound especially convincing to me[1] – how about the statistics?
Read the rest…

Microarrays, scan dates and Bioconductor: it shouldn’t be this difficult

When dealing with data from high-throughput experimental platforms such as microarrays, it’s important to account for potential batch effects. A simple example: if you process all your normal tissue samples this week and your cancerous tissue samples next week, you’re in big trouble. Differences between cancer and normal are now confounded with processing time and you may as well start over with new microarrays.

Processing date is often a good surrogate for batch and it was once easy to extract dates from Affymetrix CEL files using Bioconductor. It seems that this is no longer the case.
Read the rest…

Interestingly: the sentence adverbs of PubMed Central

Scientific writing – by which I mean journal articles – is a strange business, full of arcane rules and conventions with origins that no-one remembers but to which everyone adheres.

I’ve always been amused by one particular convention: the sentence adverb. Used with a comma to make a point at the start of a sentence, as in these examples:

Surprisingly, we find that the execution of karyokinesis and cytokinesis is timely…
Grossly, the tumor is well circumscribed with fibrous capsule…
Correspondingly, the short-term Smad7 gene expression is graded…

The example that always makes me smile is interestingly. “This is interesting. You may not have realised that. So I said interestingly, just to make it clear.”

With that in mind, let’s go looking for sentence adverbs in article abstracts.
Read the rest…

Snippets: guts, cancers, statistics

File under “interesting articles that I don’t have time to write about at length.”

  • Archaea and Fungi of the Human Gut Microbiome: Correlations with Diet and Bacterial Residents
  • Long ago, before metagenomics and NGS, I did a little work on detection of Archaea in human microbiomes. There’s a blog post in the pipeline about that but until then, enjoy this article in PLoS ONE.

  • Mutational heterogeneity in cancer and the search for new cancer-associated genes
  • This article is getting a lot of attention on Twitter this week. Brief summary: cancer cells are really messed up in all sorts of ways, most of which are not causal with respect to the cancer. Anyone who has ever looked at microarray data knows that it’s not uncommon for 50% or more of genes to show differential expression in a cancer/normal comparison, so this is hardly a new concept. I think we need to move away from ever-more detailed characterizations of the ways in which cancer cells are “messed up.” We know that they are and that doesn’t provide much insight, in my opinion.

  • The vast majority of statistical analysis is not performed by statisticians
  • Interesting post by Jeff Leek, summarized very well by its title. It points out that many more people are now interested in data analysis, many of them are not trained professionally as statisticians (I’m in this category myself) and we need to recognize and plan for that.

Bonus post doing the rounds of social media: Using Metadata to Find Paul Revere. Social network analysis, 18th-century style. Amusing, informative and topical.

Using the Ensembl Variant Effect Predictor with your 23andme data

I subscribe to the Ensembl blog and found, in my feed reader this morning, a post which linked to the Variant Effect Predictor (VEP). The original blog post, strangely, has disappeared.

Not to worry: so, the VEP takes genotyping data in one of several formats, compares it with the Ensembl variation + core databases and returns a summary of how the variants affect transcripts and regulatory regions. My first thought – can I apply this to my own 23andme data?

Read the rest…

A brief note: R 3.0.0 and bioinformatics

Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).

Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:

f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)

When this happened:

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
  allocMatrix: too many elements specified

I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.

My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.