BioMart (and biomaRt)

I’ve been vaguely aware of BioMart for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.

The concept is simple. You have a set of identifiers that describe a biological object, such as a gene. These are called filters. They have values – for example, HGNC symbols. You want to retrieve other identifiers – attributes – for your objects.

You can use BioMart as a web application called MartView. However, R users should check out the biomaRt package, part of the Bioconductor suite. Here’s a couple of examples.

Example 1: fetch Ensembl gene identifiers given HGNC symbols
Let’s start with a simple example. You have a CSV file in which one of the fields is a HGNC symbol (with the column header “hgnc”) and you want to obtain Ensembl gene IDs.

# define biomart object
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# read in the file
genes <- read.csv("myfile.csv")
# query biomart
results <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "hgnc_symbol", values = genes$hgnc, mart = mart)
# sample results
  ensembl_gene_id hgnc_symbol
1 ENSG00000082397     EPB41L3
2 ENSG00000168461       RAB31
3 ENSG00000176014       TUBB6
4 ENSG00000154734     ADAMTS1
5 ENSG00000197766         CFD
6 ENSG00000156284       CLDN8

You do need to know in advance that “ensembl_gene_id” and “hgnc_symbol” are valid attributes. You can get a list of all attributes for the current biomart object using “listAttributes(mart)”.

Example 2: fetch genes for microarray probesets
In this example, I assume that you have normalised some microarray samples using, for example, RMA in the affy package and used a method such as exprs() to generate a matrix of RMA values, where rows = probeset IDs and columns = sample names. We’d like to get the gene names for those probesets.

mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# assume that we are using the human exon array from Affymetrix
# read in .CEL files and RMA normalise
data <- read.affy()
data@cdfName <- "exon.pmcdf"
data.rma <- rma(data)
data.ex <-
# The attribute for exon array probesets is named "affy_huex_1_0_st_v2"
affy <- "affy_huex_1_0_st_v2"
# Next line would take a very long time for all exon probesets!
# We would probably select a subset of data.ex first
genes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", affy), filters = affy, values=c(rownames(data.ex)), mart = mart)
# Now match the array data probesets with the genes data frame
m <- match(rownames(data.ex), genes$affy_huex_1_0_st_v2)
# And append e.g. the HGNC symbol to the array data frame
data.ex$hgnc <- genes[m, "hgnc_symbol"]
# sample result
            Con1     Con2   Treat1   Treat2   hgnc
2315603 7.164521 7.107470 7.827158 7.307056 TTLL10
2315610 6.135751 6.259306 6.691880 6.532974 TTLL10
2315614 3.017279 4.602484 5.058326 5.349798 TTLL10
2315647 5.740181 5.373581 5.885912 5.756925   <NA>
2315691 6.389818 5.562760 6.853058 6.430730 SCNN1D
2315713 5.494848 6.243931 6.550043 6.336244 SCNN1D
2315720 6.422661 6.213908 6.447777 6.591330 SCNN1D
2315736 5.882034 6.250097 6.292414 6.311813   <NA>
2315741 5.314087 5.471424 5.762590 5.896435  PUSL1
2315768 2.278067 1.652001 2.430359 2.310668   <NA>
2315787 2.308838 1.912613 2.660703 2.377608 TAS1R3
2315793 4.339545 4.505362 4.974307 4.959468 TAS1R3

That’s your basic usage of biomaRt. In the next post: how to combine biomaRt with GenomeGraphs, to generate attractive plots of features and quantitative data in genomic context.

One thought on “BioMart (and biomaRt)

  1. Morgan Langille

    Yes, BioMart is a really nice application. The cool thing is that they have made the software very easy to set up for your own datasets. The only problem I have had with BioMart in the past is getting very large datasets (e.g. human genome data with some filters). Timeouts and errors often would leave the download half completed. However, this was a few years ago so maybe their system (at EBI) is more robust.

Comments are closed.