By the time I started my first postdoc, technology had moved on a little. We still did Sanger sequencing but the radioactive label had been replaced with coloured dyes and a laser scanner, which allowed automated reading of the sequence. During my second postdoc, this same technology was being applied to the shotgun sequencing of complete bacterial genomes. Assembling the sequence reads into contigs was pretty straightforward: there were a few software packages around, but most people used a pipeline of Phred (to call base qualities), Phrap (to assemble the reads) and Consed (for manual editing and gap-filling strategy).
The last time I worked directly on a project with sequencing data was around 2005. Jump forward 5 years to the BioStar bioinformatics Q&A forum and you’ll find many questions related to sequencing. But not sequencing as I knew it. No, this is so-called next-generation sequencing, or NGS. Suddenly, I realised that I am no longer a sequencing expert. In fact:
I am a relic from the Sanger era
I resolved to improve this state of affairs. There is plenty of publicly-available NGS data, some of it relevant to my current work and my organisation is predicting a move away from microarrays and towards NGS in 2012. So I figured: what better way to educate myself than to blog about it as I go along?
This is part 1
of a 4-part series and in this installment, we’ll look at how to get hold of public NGS data.
Read the rest…
Just a brief selection of items that caught my eye this week. Note that this is a Friday as opposed to Friday, lest you mistake this for a new, regular feature.
- R development master class
A new Bioconductor package which builds on the excellent ggplot graphics library, for the visualization of biological data.
Hadley Wickham recently presented this course on R package development for my organisation. I was on parental leave at the time, otherwise I would have attended for sure.
2. Bioinformatics in the media
DNA Sequencing Caught in Deluge of Data
I think we’re both right. Michael’s perspective is that of an expert in high-throughput sequencing data; I’m just pleased to see an introduction to bioinformatics for non-specialists in a mainstream newspaper. And I note that they have corrected the figure caption which offended Michael.
As to the “deluge”: yes, there are other sciences that generate more data and yes, we probably don’t need to archive/analyse a lot of the raw data. However, I’d contend that the basic premise of the article is correct: we are sequencing faster than we can analyse. The solution, obviously, is more bioinformaticians.
I’m the “biologist-turned-programmer” type of bioinformatician which makes me a hacker, not a developer. Most of the day-to-day coding that I do goes something like this:
Colleague: Hey Neil, can you write me a script to read data from file X, do Y to it and output a table in file Z?
Me: Sure… (clickety-click, hackety-hack…) …there you go.
Colleague: Great! Thanks.
I’m a big fan of the Bio* projects and have used them for many years, beginning with Bioperl and more recently, BioRuby. And I’ve always wanted to contribute some code to them, but have never got around to doing so. This week, two thoughts popped into my head:
- How hard can it be?
- There isn’t much introductory documentation for would-be Bio* developers
The answer to the first question is: given some programming experience, not very hard at all. This blog post is my attempt to address the second thought, by writing a step-by-step guide to developing a simple class for the BioRuby library. When I say “beginner’s guide”, I’m referring to myself as much as anyone else.
Read the rest…
It’s what – 10 years or more? – since we began to wonder when web technologies such as RSS, wikis and social bookmarking sites would be widely adopted by most working scientists, to further their productivity.
The email that I received today which began “I’ve read 3 interesting papers” and included 1 .doc, 3 .docx and 4 .pdf files as attachments is indicative of the answer to this question, which is “not any time soon.”
I’ve given up trying to educate colleagues in best practices. Clearly, I’m the one with the problem, since this is completely normal, acceptable behaviour for practically everyone that I’ve ever worked with. Instead, I’m just waiting for them to retire (or die). I reckon most senior scientists (and they’re the ones running the show) are currently aged 45-55. So it’s going to be 10-20 years before things improve.
Until then, I’ll just have to keep deleting your emails. Sorry.
Ask anyone how much time has elapsed since September last year and they’ll probably start counting on their fingers: “October, November…” and tell you “just over 9 months.”
So, when faced as I was today with a data frame (named dates) like this:
pmid1 year1 month1 pmid2 year2 month2 21355427 2010 Dec 21542215 2011 Mar 21323727 2011 Feb 21521365 2011 Jun 21297532 2011 Feb 21336080 2011 Mar 21291296 2011 Apr 21591868 2011 Jun ...
How to add a 7th column, with the number of months between “year1/month1” and “year2/month2”?
Read the rest…
File under: simple, but a useful reminder
UCSC Genome Bioinformatics is one of the go-to locations for genomic data. They are also kind enough to provide access to their MySQL database server:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
However, users are given fair warning to “avoid excessive or heavy queries that may impact the server performance.” It’s not clear what constitutes excessive or heavy but if you’re in any doubt, it’s easy to create your own databases locally. It’s also easy to create only the tables that you require, as and when you need them.
As an example, here’s how you could create only the ensGene table for the latest hg19 database. Here, USER and PASSWD represent a local MySQL user and password with full privileges:
# create database mysql -u USER -pPASSWD -e 'create database hg19' # obtain table schema wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.sql # create table mysql -u USER -pPASSWD hg19 < ensGene.sql # obtain and import table data wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz gunzip ensGene.txt.gz mysqlimport -u USER -pPASS --local hg19 ensGene.txt
It’s very easy to automate this kind of process using shell scripts. All you need to know is the base URL for the data, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ and that there are two files with the same prefix per table: one for the schema (*.sql) and one with the data (*.txt.gz).
First for 2011:
Proteomic and electron microscopy survey of large assemblies in macrophage cytoplasm.
Maco, B., Ross I.L., Landsberg, M., Mouradov, D., Saunders, N.F.W., Hankamer, B. and Kobe, B. (2011)
Molecular & Cellular Proteomics, in press, doi:10.1074/mcp.M111.008763
This is an in-press article which is freely-available just now (although strangely, the supplemental data are not). I’m pleased to note that we also made the raw data available in Proteome Commons. In fact, it was a condition of publication.
Lots of hard work went into this one. My contribution was quite minor: some bioinformatic analysis and hacking away at PyMsXML to make it work with newer versions of vendor formats. I’d like to thank Brad Chapman with respect to PyMsXML, who provided invaluable advice via BioStar.
Warning: contains murky, somewhat unstructured thoughts on large-scale biological data analysis
Picture this. It’s based on a true story: names and details altered.
Alice, a biomedical researcher, performs an experiment to determine how gene expression in cells from a particular tissue is altered when the cells are exposed to an organic compound, substance Y. She collates a list of the most differentially-expressed genes and notes, in passing, that the expression of Gene X is much lower in the presence of substance Y.
Bob, a bioinformatician in the same organisation but in a different city to Alice, is analysing a public dataset. This experiment looks at gene expression in the same tissue but under different conditions: normal compared with a disease state, Z Syndrome. He also notes that Gene X appears in his list – its expression is much higher in the diseased tissue.
Alice and Bob attend the annual meeting of their organisation, where they compare notes and realise the potential significance of substance Y in suppressing the expression of Gene X and so perhaps relieving the symptoms of Z syndrome. On hearing this the head of the organisation, Charlie, marvels at the serendipitous nature of the discovery. Surely, he muses, given the amount of publicly-available experimental data, there must be a way to automate this kind of discovery by somehow “cross-correlating” everything with everything else until patterns emerge. What we need, states Charlie, is:
Algorithms running day and night, crunching all of that data
What’s Charlie missing?
Read the rest…
The API – Application Programming Interface – is, in principle, a wonderful thing. You make a request to a server using a URL and back come lovely, structured data, ready to parse and analyse. We’ve begun to demand that all online data sources offer an API and lament the fact that so few online biological databases do so.
Better though, to have no API at all than one which is poorly implemented and leads to frustration? I’m beginning to think so, after recent experiences on both a work project and one of my “fun side projects”. Let’s start with the work project, an attempt to mine a subset of the ArrayExpress microarray database.
Read the rest…