GEO database: curation lagging behind submission?

geometadb

GSE and GDS records in GEOmetadb by date

I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.

Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records?

library(GEOmetadb)
library(ggplot2)

# update database if required using getSQLiteFile()
# connect to database; assumed to be in user $HOME
con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite")

# fetch "last updated" dates for GDS and GSE
gds <- dbGetQuery(con, "select update_date from gds")
gse <- dbGetQuery(con, "select last_update_date from gse")

# cumulative sums by date; no factor variables
gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F)
gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F)

# make GDS and GSE data frames comparable
colnames(gds.count) <- "count"
colnames(gse.count) <- "count"

# row names (dates) to real dates
gds.count$date <- as.POSIXct(rownames(gds.count))
gse.count$date <- as.POSIXct(rownames(gse.count))

# add type for plotting
gds.count$type <- "gds"
gse.count$type <- "gse"

# combine GDS and GSE data frames
gds.gse <- rbind(gds.count, gse.count)

# and plot records over time by type
png(filename = "geometadb.png", width = 800, height = 600)
print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type)))
dev.off()

2 thoughts on “GEO database: curation lagging behind submission?

  1. Joel Dudley

    I’m pretty certain that NCBI have given up on GDS and that no more experiments will be curated as GDS due to the growth rate of GEO. We download all of GEO every week and no new GDS for a long time. I do believe I heard a talk from one of the top GEO folks indicating that GDS creation has ceased. We’ve actually created a method to generate GDS automatically from any new GSE. Keep your eyes on the bioinformatics journals over the next few months. :-)

Comments are closed.