Looking for “>” in all the wrong places

Ever wondered whether the “>” symbol can, or does, appear in FASTA sequence headers at positions other than the start of the line?

I have a recent copy of the NCBI non-redundant nucleotide (nt) BLAST database on my server, so let’s take a look. The files are in a directory which is also named nt:

# %i = sequence ID; %t = sequence title
blastdbcmd -db /data/db/nt/nt -entry all -outfmt '%i %t' | grep ">" > ~/Documents/nt.txt

wc -l ~/Documents/nt.txt
# => 54451 /home/sau103/Documents/nt.txt

# and how many sequences total in nt?
blastdbcmd -list /data/db/nt/ -list_outfmt '%n' | head -1
# => 23745273

Short answer – yes, about 0.23% of nt sequence descriptions contain the “>” character. Inspection of the output shows that it’s used in a number of ways. A few examples:

# as "brackets" (very common)
emb|V01351.1| Sea urchin fragment, 3' to the actin gene in <SPAC01>
gb|GU086899.1| Cotesia sp. jft03 voucher BIOUG<CAN>:CPWH-0042 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial

# to indicate mutated bases or amino acids
gb|M21581.1|SYNHUMUBA Synthetic human ubiquitin gene (Thr14->Cys), complete cds
dbj|AB047520.1| Homo sapiens gene for PER3, exon 1, 128+22(G->A) polymorphism

# in chemical nomenclature
gb|AF134414.1|AF134414 Homo sapiens B-specific alpha 1->3 galactosyltransferase (ABO) mRNA, ABO-*B101 allele, complete cds

# as "arrows"
gb|EU303182.1| Apoi virus note kitaoka-> canals ->NIMR nonstructural protein 5 (NS5) gene, partial cds
ref|XM_001734501.1| Entamoeba dispar SAW760 5'->3' exoribonuclease, putative EDI_265520 mRNA, complete cds

Something to keep in mind when writing code to process FASTA format.

3 thoughts on “Looking for “>” in all the wrong places

  1. Pingback: Looking for ">" in all the wrong p...

  2. John

    Hm… You’d think I would have known this after years working with fasta files… I should not use $/ = “>”; as my quick and dirty perl hack then… I’ll share this with my colleagues

Comments are closed.