As so often happens these days, a brief post at FriendFeed got me thinking about data analysis. Entitled “So how many retractions are there every year, anyway?”, the post links to this article at Retraction Watch. It discusses ways to estimate the number of retractions and in particular, a recent article in the Journal of Medical Ethics (subscription only, sorry) which addresses the issue.
As Christina pointed out in a comment at Retraction Watch, there are thousands of scientific journals of which PubMed indexes only a fraction. However, PubMed is relatively easy to analyse using a little Ruby and R. So, here we go…
Read the rest…
…was the tongue-in-cheek title of an image that I posted to Twitpic this week. It shows the usage of the word “novel” in PubMed article titles over time. As someone correctly pointed out at FriendFeed, it needs to be corrected for total publications per year.
It was inspired by a couple of items that caught my attention. First, a question at BioStar with the self-explanatory title Locations of plots of quantities of publicly available biological data. Second, an item at FriendFeed musing on the (over?) use of the word “insight” in scientific publications.
I’m sure that quite recently, I’ve read a letter to a journal which analysed the use of phrases such as “novel insights” in articles over time, but it’s currently eluding my search skills. So here’s my simple roll-your-own approach, using a little Ruby and R.
Read the rest…
I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.
Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records?
library(GEOmetadb) library(ggplot2) # update database if required using getSQLiteFile() # connect to database; assumed to be in user $HOME con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite") # fetch "last updated" dates for GDS and GSE gds <- dbGetQuery(con, "select update_date from gds") gse <- dbGetQuery(con, "select last_update_date from gse") # cumulative sums by date; no factor variables gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F) gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F) # make GDS and GSE data frames comparable colnames(gds.count) <- "count" colnames(gse.count) <- "count" # row names (dates) to real dates gds.count$date <- as.POSIXct(rownames(gds.count)) gse.count$date <- as.POSIXct(rownames(gse.count)) # add type for plotting gds.count$type <- "gds" gse.count$type <- "gse" # combine GDS and GSE data frames gds.gse <- rbind(gds.count, gse.count) # and plot records over time by type png(filename = "geometadb.png", width = 800, height = 600) print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type))) dev.off()