Tag Archives: twitter

Looking for “>” in all the wrong places

Ever wondered whether the “>” symbol can, or does, appear in FASTA sequence headers at positions other than the start of the line?

I have a recent copy of the NCBI non-redundant nucleotide (nt) BLAST database on my server, so let’s take a look. The files are in a directory which is also named nt:

# %i = sequence ID; %t = sequence title
blastdbcmd -db /data/db/nt/nt -entry all -outfmt '%i %t' | grep ">" > ~/Documents/nt.txt

wc -l ~/Documents/nt.txt
# => 54451 /home/sau103/Documents/nt.txt

# and how many sequences total in nt?
blastdbcmd -list /data/db/nt/ -list_outfmt '%n' | head -1
# => 23745273

Short answer – yes, about 0.23% of nt sequence descriptions contain the “>” character. Inspection of the output shows that it’s used in a number of ways. A few examples:

# as "brackets" (very common)
emb|V01351.1| Sea urchin fragment, 3' to the actin gene in <SPAC01>
gb|GU086899.1| Cotesia sp. jft03 voucher BIOUG<CAN>:CPWH-0042 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial

# to indicate mutated bases or amino acids
gb|M21581.1|SYNHUMUBA Synthetic human ubiquitin gene (Thr14->Cys), complete cds
dbj|AB047520.1| Homo sapiens gene for PER3, exon 1, 128+22(G->A) polymorphism

# in chemical nomenclature
gb|AF134414.1|AF134414 Homo sapiens B-specific alpha 1->3 galactosyltransferase (ABO) mRNA, ABO-*B101 allele, complete cds

# as "arrows"
gb|EU303182.1| Apoi virus note kitaoka-> canals ->NIMR nonstructural protein 5 (NS5) gene, partial cds
ref|XM_001734501.1| Entamoeba dispar SAW760 5'->3' exoribonuclease, putative EDI_265520 mRNA, complete cds

Something to keep in mind when writing code to process FASTA format.

A lesson in “reading before you tweet”

So, I read the title:

Mining locus tags in PubMed Central to improve microbial gene annotation

and skimmed the abstract:

The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases.

and thought, well OK, but wouldn’t it be better to incorporate annotations in the first place – when submitting to the public databases – rather than by this indirect method?

The point, of course, is to incorporate new findings from the literature into existing records, rather than to use the tool as a primary method of annotation. I do believe that public databases could do more to enforce data quality standards at deposition time, but that’s an entirely separate issue.

Big thanks to Michael Hoffman for a spirited Twitter discussion that put me straight.

ISMB 2012 on Twitter: here today, gone tomorrow

In previous years, when FriendFeed was used as the micro-blogging platform for the annual ISMB meeting, I’ve written a post describing some statistical analysis of the conference coverage. Here’s my post from last year.

This year, it appears that the majority of the conference coverage happened at Twitter, using the #ISMB hashtag. Here’s what happened on July 18th when I used the R package twitteR to retrieve ISMB-related tweets for July 13/14:

library(twitteR)
ismb1 <- searchTwitter("#ISMB", since = "2012-07-13", until = "2012-07-14")
length(ismb1)
# [1] 383

383 tweets. Here’s what happened when I ran the same query today:

library(twitteR)
ismb1 <- searchTwitter("#ISMB", since = "2012-07-13", until = "2012-07-14")
length(ismb1)
# [1] 0

Zero tweets. Indeed, run the same query via the Twitter web interface and you’ll see only a very few tweets with the message “Older Tweet results for #ismb are unavailable.”

So far as Twitter is concerned, ISMB 2012 never happened. Or if it did, the data are buried away in a data centre, inaccessible to the likes of you and I. Did you ever hear anything more about that plan to archive every Tweet at the Library of Congress? Neither did I. I very much doubt that it’s going to happen.

I think Twitter is great – for broadcasting short pieces of information, such as useful URLs, in near real-time. For conference coverage which benefits from threaded conversation, longer comments and archiving, I think it’s rubbish.

On July 18 I did manage to retrieve 3162 Tweets for ISMB 2012, created between July 13 and July 17. I’ll write about them in a forthcoming post. All I’ll say for now is – lucky I was able to grab them when I did.

ISMB coverage on Twitter? It’s possible there was…

Peter writes:

I wonder if part of the drop off is live bloggers moving to platforms like Twitter? I can tell you it seemed like there were almost as many tweets for one SIG (#bosc2011) as for the whole of #ISMB / #ECCB2011, and I personally didn’t post anything to FriendFeed but posted lots on Twitter.

Well, there’s a problem with using Twitter for analysis of conference coverage. Let’s try searching for ISMB-related tweets using the twitteR package:

library(twitteR)
ismb <- searchTwitter("ismb", 1000)
length(ismb)
# [1] 30

oldertweets

If we can't archive, how can anyone else?

30? Are we using twitteR properly? Running the same search at the Twitter website gives roughly the same results, plus this unhelpful message.

I like Twitter – as a real-time communication tool. As a data archive? Forget it.

Two great open science resources

The Twitter + FriendFeed combination is proving to be a very useful information stream; not just from other people but as a reminder of what I thought was worth sharing. Two links from there that I think deserve wider attention:

  • One Big Lab proposes that we become, well, one big lab – and has some ideas as to how that might work.
  • From the OWW wiki, an excellent article on python in computational biology. This has been presented at Pycon 2008 and is also a companion article to a paper in PLoS Computational Biology. Imagine if everyone described their methods in this detail.

Deepak has some commentary on what we’re now calling the “bio-twitterverse”.