Popular topics at the BioStar Q&A site

Which topics are the most popular at the BioStar bioinformatics Q&A site?

One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.

OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.

1. Fetch the tags
Fortunately, I enjoy sufficient privileges at BioStar to obtain a dump of the database. It contains a file named “Tags.xml”, with this simple structure:

<Tags>
  <row>
    <Id>3</Id>
    <Name>bed</Name>
    <Count>20</Count>
    <UserId>2</UserId>
    <CreationDate>2009-09-30T14:55:00.167</CreationDate>
  </row>
  ...
</Tags>

A hint for people who write XML parsing documentation. Most of us just want to get the values from between the tags. Just tell us how to do that. OK?

Thanks to this StackOverflow thread, I discovered the incredibly-useful xmlToDataFrame() function in the R XML package:

library(XML)
tags <- xmlToDataFrame("Tags.xml")
head(tags)
#   Id       Name Count UserId            CreationDate
# 1  3        bed    20      2 2009-09-30T14:55:00.167
# 2  4        gff    12      2 2009-09-30T14:55:00.167
# 3  5     galaxy    11      2 2009-09-30T15:09:43.417
# 4  6      yeast     5      3 2009-09-30T16:09:06.723
# 5  7      motif    19      3  2009-09-30T16:09:06.74
# 6  8 microarray    96      2 2009-09-30T16:44:22.677

Too easy. However, class(tags$Count) = “character”, which is not quite not we want. So let’s change that to numeric, then sort the data frame on Count, decreasing:

tags$Count <- as.numeric(tags$Count)
tags <- tags[sort.list(tags$Count, decreasing = T),]

2. For those who like a “top N” plot
Next, we’ll grab the top 20 tags by Count. To plot them in decreasing order, we need to reorder the tag Name by Count. With thanks again to a StackOverflow thread.

library(ggplot2)
tags.20 <- head(tags, 20)
tags.20 <- transform(tags.20, Name = reorder(Name, Count))
ggplot(tags.20) + geom_bar(aes(Name, Count), fill = "coral") + coord_flip() + theme_bw() + opts(title = "Top 20 BioStar Tags")

Click image, right, for full-size version.

Top 20 Biostar Tags

3. For those who like word/tag clouds
Here, we look at tags which occur 10 or more times and display a maximum of 1000 tags in the cloud.

library(wordcloud)
library(RColorBrewer)

png(file = "tags.png", width = 1024, height = 1024)
wordcloud(tags$Name, tags$Count, scale = c(8,.2), min.freq = 10, max.words = 1000, random.order = F, rot.per = .15, colors = brewer.pal(8, "Dark2"))
dev.off()

Again, click image for the full-size version.

BioStar tag cloud

Conclusions? XML, ggplot2 and wordcloud are all great packages. And whilst so-called “next-generation-sequencing” might be all the rage, it’s good to see the old stalwarts of bioinformatics hanging in there: BLAST, alignment, phylogenetics, Python and Perl. It will be interesting to see how tags change over time.

3 thoughts on “Popular topics at the BioStar Q&A site”

Jason Ebaugh

August 23, 2011 at 22:23

That’s a cool analysis. This is indeed something that will be interesting to watch over time.
Chris Miller

August 23, 2011 at 23:59

Fyi – You’ll see a similar circular tag cloud in the biostar manuscript (which has been accepted at Plos).
- nsaunders
  
  August 24, 2011 at 16:16
  
  Good to hear, great minds :) Look forward to reading it.

Comments are closed.

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

Popular topics at the BioStar Q&A site

3 thoughts on “Popular topics at the BioStar Q&A site”

Share this:

Related

3 thoughts on “Popular topics at the BioStar Q&A site”