Analysis of gene expression timecourse data using maSigPro

ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposure

ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposure

About a year ago, I did a little work on a very interesting project which was trying to identify blood-based biomarkers for the early detection of stroke. The data included gene expression measurements using microarrays at various time points after the onset of ischemia (reduced blood supply). I had not worked with timecourse data before, so I went looking for methods and found a Bioconductor package, maSigPro, which did exactly what I was looking for. In combination with ggplot2, it generated some very attractive and informative plots of gene expression over time.

I was very impressed by maSigPro and meant to get around to writing a short guide showing how to use it. So I did finally, using RMarkdown to create the document and here it is. The document also illustrates how to retrieve datasets from GEO using GEOquery and annotate microarray probesets using biomaRt. Hopefully it’s useful to some of you.

I’ll probably do more of this in the future, since publishing RMarkdown to RPubs is far easier than copying, pasting and formatting at WordPress.

Some basics of biomaRt

One of the commonest bioinformatics questions, at Biostars and elsewhere, takes the form: “I have a list of identifiers (X); I want to relate them to a second set of identifiers (Y)”. HGNC gene symbols to Ensembl Gene IDs, for example.

When this occurs I have been known to tweet “the answer is BioMart” (there are often other solutions too) and I’ve written a couple of blog posts about the R package biomaRt in the past. However, I’ve realised that we need to take a step back and ask some basic questions that new users might have. How do I find what marts and datasets are available? How do I know what attributes and filters to use? How do I specify different genome build versions?
Continue reading

Sequencing for relics from the Sanger era part 1: getting the raw data

Sequencing in the good old days

In another life, way back in the mists of time, I did a Ph.D. Part of my project was to sequence a bacterial gene which encoded an enzyme involved in nitrite metabolism. It took the best part of a year to obtain ~ 2 000 bp of DNA sequence: partly because I was rubbish at sequencing, but also because of the technology at the time. It was an elegant biochemical technique called the dideoxy chain termination method, or “Sanger sequencing” after its inventor. Sequence was visualized by exposing radioactively-labelled DNA to X-ray film, resulting in images like the one at left, from my thesis. Yes, that photograph is glued in place. The sequence was read manually, by placing the developed film on a light box, moving a ruler and writing down the bases.

By the time I started my first postdoc, technology had moved on a little. We still did Sanger sequencing but the radioactive label had been replaced with coloured dyes and a laser scanner, which allowed automated reading of the sequence. During my second postdoc, this same technology was being applied to the shotgun sequencing of complete bacterial genomes. Assembling the sequence reads into contigs was pretty straightforward: there were a few software packages around, but most people used a pipeline of Phred (to call base qualities), Phrap (to assemble the reads) and Consed (for manual editing and gap-filling strategy).

The last time I worked directly on a project with sequencing data was around 2005. Jump forward 5 years to the BioStar bioinformatics Q&A forum and you’ll find many questions related to sequencing. But not sequencing as I knew it. No, this is so-called next-generation sequencing, or NGS. Suddenly, I realised that I am no longer a sequencing expert. In fact:

I am a relic from the Sanger era

I resolved to improve this state of affairs. There is plenty of publicly-available NGS data, some of it relevant to my current work and my organisation is predicting a move away from microarrays and towards NGS in 2012. So I figured: what better way to educate myself than to blog about it as I go along?

This is part 1 of a 4-part series and in this installment, we’ll look at how to get hold of public NGS data.
Read the rest…

biomaRt and GenomeGraphs: a worked example

As promised a few posts ago, another demonstration of the excellent biomaRt package, this time in conjunction with GenomeGraphs.

Here’s what we’re going to do:

  1. Grab some public microarray data
  2. Normalise and get a list of the most differentially-expressed probesets
  3. Use biomaRt to fetch the genes associated with those probesets
  4. Plot the data using GenomeGraphs

If you want to follow along on your own machine, it will need to be quite powerful. We’ll be processing exon arrays, which requires a 64-bit machine with at least 4 GB (and preferably, at least 8-12 GB) of RAM. As usual, I’m assuming some variant of Linux and that you’re comfortable at the command line.
Read the rest…

Missing links: using SwissProt IDtracker in your code

The BioPerl Bio::DB::SwissProt module lets you fetch sequences from SwissProt by ID or AC and store them as sequence objects:

use Bio::DB::SwissProt;
my $sp = Bio::DB::SwissProt->new('-servertype' => 'expasy', 'hostlocation' => 'australia');
my $seq = $sp->get_Seq_by_id('myod1_pig');

If you obtained SwissProt identifiers from a database that hasn’t been updated for some time, you may find that the ID or AC has changed. For example at NLSdb, the ID from the example shown is given as “myod_pig”. In this case, BioPerl will throw an error like this:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: id does not exist
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.8.8/Bio/Root/
STACK: Bio::DB::WebDBSeqI::get_Seq_by_id /usr/local/share/perl/5.8.8/Bio/DB/

SwissProt provides a web page named IDtracker to help you find new identifiers using old ones. Here’s how we can integrate the service into Perl.
Read the rest…