Looking for “>” in all the wrong places

Ever wondered whether the “>” symbol can, or does, appear in FASTA sequence headers at positions other than the start of the line?

I have a recent copy of the NCBI non-redundant nucleotide (nt) BLAST database on my server, so let’s take a look. The files are in a directory which is also named nt:

# %i = sequence ID; %t = sequence title
blastdbcmd -db /data/db/nt/nt -entry all -outfmt '%i %t' | grep ">" > ~/Documents/nt.txt

wc -l ~/Documents/nt.txt
# => 54451 /home/sau103/Documents/nt.txt

# and how many sequences total in nt?
blastdbcmd -list /data/db/nt/ -list_outfmt '%n' | head -1
# => 23745273

Short answer – yes, about 0.23% of nt sequence descriptions contain the “>” character. Inspection of the output shows that it’s used in a number of ways. A few examples:

# as "brackets" (very common)
emb|V01351.1| Sea urchin fragment, 3' to the actin gene in <SPAC01>
gb|GU086899.1| Cotesia sp. jft03 voucher BIOUG<CAN>:CPWH-0042 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial

# to indicate mutated bases or amino acids
gb|M21581.1|SYNHUMUBA Synthetic human ubiquitin gene (Thr14->Cys), complete cds
dbj|AB047520.1| Homo sapiens gene for PER3, exon 1, 128+22(G->A) polymorphism

# in chemical nomenclature
gb|AF134414.1|AF134414 Homo sapiens B-specific alpha 1->3 galactosyltransferase (ABO) mRNA, ABO-*B101 allele, complete cds

# as "arrows"
gb|EU303182.1| Apoi virus note kitaoka-> canals ->NIMR nonstructural protein 5 (NS5) gene, partial cds
ref|XM_001734501.1| Entamoeba dispar SAW760 5'->3' exoribonuclease, putative EDI_265520 mRNA, complete cds

Something to keep in mind when writing code to process FASTA format.

Venn figures go wrong

6-way Venn banana

6-way Venn banana

I thought nothing could top the classic “6-way Venn banana“, featured in The banana (Musa acuminata) genome and the evolution of monocotyledonous plants.

That is until I saw Figure 3 from Compact genome of the Antarctic midge is likely an adaptation to an extreme environment.

5-way Venn roadkill

5-way Venn roadkill

What’s odd is that Figure 2 in the latter paper is a nice, clear R/ggplot2 creation, using facet_grid(), so someone knew what they were doing.

That aside, the Antarctic midge paper is an interesting read; go check it out.

This led to some amusing Twitter discussion which pointed me to *A New Rose : The First Simple Symmetric 11-Venn Diagram.

[*] +1 for referencing The Damned, if indeed that was the intention.

When life gives you coloured cells, make categories

Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.

I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.
Continue reading

A new low in “databases”: the PDF

I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed).

HEp-2 or not HEp2?

HEp-2 or not HEp2?

Anyway, we have something better. I was exploring PubMed Commons, which is becoming a very good resource. The top-featured comment looks very interesting (see image, right).

Intrigued, I went to investigate the Database of Cross-contaminated or Misidentified Cell Lines, hovered over the download link and saw that it’s – wait for it – a PDF. I’ll say that again. The “database” is a PDF.

The sad thing is that this looks like very useful, interesting information which I’m sure would be used widely if presented in an appropriate (open) format and better-publicised. Please, biological science, stop embarrassing yourself. If you don’t know how to do data properly, talk to someone who does.

Tool tip: dropbox-restore

I’m currently rather sleep-deprived and prone to doing stupid things. Like this, for example:

rsync -av ~/Dropbox /path/to/backup/directory/

where the directory /path/to/backup/directory already contains a much-older Dropbox directory. So when I set up a new machine, install Dropbox and copy the Dropbox directory back to its default location – hey! What happened to all my space? What are all these old files? Oh wait…I forgot to delete:

rsync -av --delete ~/Dropbox /path/to/backup/directory/

Now, files can be restored of course, but not when there are thousands of them and I don’t even know what’s old and new. What I want to do is restore the directories under ~/Dropbox to the state that they were in yesterday, before I stuffed up.

Luckily Chris Clark wrote dropbox-restore. It does exactly what it says on the tin. For example:

python restore.py /Camera\ Uploads 2014-07-22

Thanks Chris!

Converting a spreadsheet of SMILES: my first OSM contribution

I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.

So: I was happy to make a small contribution recently in response to this request for help:

Read the rest…

utils4bioinformatics: all those “little snippets” in one place

Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?

Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.

Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.

This is why code written by scientists gets ugly

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

Update: those of you commenting “should have used Python instead” have completely missed the point. Your comments are off-topic and will not be published. Doubly-so when you get snarky about it.

Read the rest…