I’ve been vaguely aware of BioMart for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.
The concept is simple. You have a set of identifiers that describe a biological object, such as a gene. These are called filters. They have values – for example, HGNC symbols. You want to retrieve other identifiers – attributes – for your objects.
You can use BioMart as a web application called MartView. However, R users should check out the biomaRt package, part of the Bioconductor suite. Here’s a couple of examples.
Example 1: fetch Ensembl gene identifiers given HGNC symbols
Let’s start with a simple example. You have a CSV file in which one of the fields is a HGNC symbol (with the column header “hgnc”) and you want to obtain Ensembl gene IDs.
library(biomaRt) # define biomart object mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") # read in the file genes <- read.csv("myfile.csv") # query biomart results <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "hgnc_symbol", values = genes$hgnc, mart = mart) # sample results ensembl_gene_id hgnc_symbol 1 ENSG00000082397 EPB41L3 2 ENSG00000168461 RAB31 3 ENSG00000176014 TUBB6 4 ENSG00000154734 ADAMTS1 5 ENSG00000197766 CFD 6 ENSG00000156284 CLDN8
You do need to know in advance that “ensembl_gene_id” and “hgnc_symbol” are valid attributes. You can get a list of all attributes for the current biomart object using “listAttributes(mart)”.
Example 2: fetch genes for microarray probesets
In this example, I assume that you have normalised some microarray samples using, for example, RMA in the affy package and used a method such as exprs() to generate a matrix of RMA values, where rows = probeset IDs and columns = sample names. We’d like to get the gene names for those probesets.
library(simpleaffy) library(biomaRt) mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") # assume that we are using the human exon array from Affymetrix # read in .CEL files and RMA normalise data <- read.affy() data@cdfName <- "exon.pmcdf" data.rma <- rma(data) data.ex <- as.data.frame(exprs(data.rma)) # The attribute for exon array probesets is named "affy_huex_1_0_st_v2" affy <- "affy_huex_1_0_st_v2" # Next line would take a very long time for all exon probesets! # We would probably select a subset of data.ex first genes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", affy), filters = affy, values=c(rownames(data.ex)), mart = mart) # Now match the array data probesets with the genes data frame m <- match(rownames(data.ex), genes$affy_huex_1_0_st_v2) # And append e.g. the HGNC symbol to the array data frame data.ex$hgnc <- genes[m, "hgnc_symbol"] # sample result Con1 Con2 Treat1 Treat2 hgnc 2315603 7.164521 7.107470 7.827158 7.307056 TTLL10 2315610 6.135751 6.259306 6.691880 6.532974 TTLL10 2315614 3.017279 4.602484 5.058326 5.349798 TTLL10 2315647 5.740181 5.373581 5.885912 5.756925 <NA> 2315691 6.389818 5.562760 6.853058 6.430730 SCNN1D 2315713 5.494848 6.243931 6.550043 6.336244 SCNN1D 2315720 6.422661 6.213908 6.447777 6.591330 SCNN1D 2315736 5.882034 6.250097 6.292414 6.311813 <NA> 2315741 5.314087 5.471424 5.762590 5.896435 PUSL1 2315768 2.278067 1.652001 2.430359 2.310668 <NA> 2315787 2.308838 1.912613 2.660703 2.377608 TAS1R3 2315793 4.339545 4.505362 4.974307 4.959468 TAS1R3
Summary
That’s your basic usage of biomaRt. In the next post: how to combine biomaRt with GenomeGraphs, to generate attractive plots of features and quantitative data in genomic context.
Yes, BioMart is a really nice application. The cool thing is that they have made the software very easy to set up for your own datasets. The only problem I have had with BioMart in the past is getting very large datasets (e.g. human genome data with some filters). Timeouts and errors often would leave the download half completed. However, this was a few years ago so maybe their system (at EBI) is more robust.