Archive for ‘statistics’

March 14, 2012

Simple plots reveal interesting artifacts

I’ve recently been working with methylation data; specifically, from the Illumina Infinium HumanMethylation450 bead chip. It’s a rather complex array which uses two types of probes to determine the methylation state of DNA at ~ 485 000 sites in the genome.

The Bioconductor project has risen to the challenge with a (somewhat bewildering) variety of packages to analyse data from this chip. I’ve used lumi, methylumi and minfi, but will focus here on just the latter.

Update: a related post from Oliver Hofmann (@fiamh).
Read the rest…

February 13, 2012

10 years on, same old same old

September 2, 2002

So what new skills will postdocs need to ensure that they don’t become science relics? The answer is math, statistics, and knowledge of a scripting language for computers.

– ­The Scientist, “Bioinformatics Knowledge Vital to Careers.” 16(17): 53.

February 8 2012

But two other skills are increasingly necessary: expertise in computer-programming languages designed to aid manipulation of large data sets, such as R, Perl or Python, and the ability to use these languages to analyse large amounts of data quickly.

– Nature, “Biostatistics: Revealing analysis.” 482: 263–265.

January 27, 2012

Reproducible research: three links that made me think

I’m constantly amazed, bemused and troubled by how little published scientific research is genuinely reproducible, in that you or I (or even the original authors) could go back and check the results. Three examples from around the Web converged in my mind this week.
Read the rest…

December 2, 2011

A Friday round-up

Just a brief selection of items that caught my eye this week. Note that this is a Friday as opposed to Friday, lest you mistake this for a new, regular feature.

1. R/statistics

  • ggbio
  • A new Bioconductor package which builds on the excellent ggplot graphics library, for the visualization of biological data.

  • R development master class
  • Hadley Wickham recently presented this course on R package development for my organisation. I was on parental leave at the time, otherwise I would have attended for sure.

2. Bioinformatics in the media
DNA Sequencing Caught in Deluge of Data

I described this NYT article as a “surprisingly-good intro article“. Michael Eisen described it as “kind of silly“.

I think we’re both right. Michael’s perspective is that of an expert in high-throughput sequencing data; I’m just pleased to see an introduction to bioinformatics for non-specialists in a mainstream newspaper. And I note that they have corrected the figure caption which offended Michael.

As to the “deluge”: yes, there are other sciences that generate more data and yes, we probably don’t need to archive/analyse a lot of the raw data. However, I’d contend that the basic premise of the article is correct: we are sequencing faster than we can analyse. The solution, obviously, is more bioinformaticians.

September 8, 2011

Interacting with bioinformatics webservers using R

In an ideal world, all bioinformatics tools would be made available via the Web as a web service with an API, as well as a standalone package to download for local use. This is rarely the case and sometimes, even where one or the other is available, factors such as cost come into play. So we resort to web scraping; writing code to interact with the code that lies behind a web server so as to submit queries, retrieve and parse results.

Normally, I’d use something like Ruby’s Mechanize library for this purpose. However, where the purpose is to retrieve delimited data for analysis using R, I figured it was time to try and achieve the entire process within R. So here’s how I used the RCurl and XML packages to interact with the WHAT IF server, which provides tools for the analysis of protein structure.
Read the rest…

August 23, 2011

Popular topics at the BioStar Q&A site

Which topics are the most popular at the BioStar bioinformatics Q&A site?

One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.

OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.

Read the rest…

August 16, 2011

Monitoring PubMed retractions: updates

chart

PubMed cumulative retractions 1977-present

There’s been a recent flurry of interest in retractions. See for example: Scientific Retractions: A Growth Industry?; summarised also by GenomeWeb in Take That Back; articles in the WSJ and the Pharmalot blog; and academic articles in the Journal of Medical Ethics and Infection & Immunity.

Several of these sources cite data from my humble web application, PMRetract. So now seems like a good time to mention that:

  • The application is still going strong and is updated regularly
  • I’ve added a few enhancements to the UI; you can follow development at GitHub
  • I’ve also added a long-overdue about page with some extra information, including the fact that I wrote it :)

Now I just need to fix up my Git repositories. Currently there’s one which pushes to GitHub and a second, with a copy of the Sinatra code for pushing to Heroku, which isn’t too smart.

August 1, 2011

ISMB coverage on Twitter? It’s possible there was…

Peter writes:

I wonder if part of the drop off is live bloggers moving to platforms like Twitter? I can tell you it seemed like there were almost as many tweets for one SIG (#bosc2011) as for the whole of #ISMB / #ECCB2011, and I personally didn’t post anything to FriendFeed but posted lots on Twitter.

Well, there’s a problem with using Twitter for analysis of conference coverage. Let’s try searching for ISMB-related tweets using the twitteR package:

library(twitteR)
ismb <- searchTwitter("ismb", 1000)
length(ismb)
# [1] 30

oldertweets

If we can't archive, how can anyone else?

30? Are we using twitteR properly? Running the same search at the Twitter website gives roughly the same results, plus this unhelpful message.

I like Twitter – as a real-time communication tool. As a data archive? Forget it.

Tags: ,
July 28, 2011

I can’t resist a word cloud: now using R!

wcloud

Top 1000 words in FriendFeed comments, ISMB 2008-2011

The wordcloud package is word clouds for R with a difference: they look great.

Of course, having just analysed online coverage of the ISMB conference, I had to run all 6 906 comments from the 2008-2011 meetings through some code. If you followed along via the Sweave code, I went as far as generating the data frame of comments, ismb.comments, then pulled the comment text into a new data frame using:

data.frame(ismb.comments$body)

It was then simply a case of following along with the excellent example code from the post Word Cloud in R, over at One R Tip A Day, limiting myself to the 1000 most-used words. Watch out, the TermDocumentMatrix() function from the tm package uses quite a lot of memory.

Result shown at right: click image for full-size version. I think that word in the centre says it all.

Tags: ,
July 28, 2011

Analysis of ISMB coverage at FriendFeed: 2008 – 2011

ISMB/ECCB 2011 was held between July 15-19 this year and as in previous years, FriendFeed was used to cover the meeting.

Last year, I wrote a post about how to use R to analyse the coverage. I was planning something similar for 2011 when I thought: we have 4 years of ISMB at FriendFeed now – why not look at all of them?

So I did. Read on for the details.
Read the rest…

Tags: ,
Follow

Get every new post delivered to your Inbox.

Join 2,204 other followers