Ever wondered whether the “>” symbol can, or does, appear in FASTA sequence headers at positions other than the start of the line?
I have a recent copy of the NCBI non-redundant nucleotide (nt) BLAST database on my server, so let’s take a look. The files are in a directory which is also named nt:
# %i = sequence ID; %t = sequence title blastdbcmd -db /data/db/nt/nt -entry all -outfmt '%i %t' | grep ">" > ~/Documents/nt.txt wc -l ~/Documents/nt.txt # => 54451 /home/sau103/Documents/nt.txt # and how many sequences total in nt? blastdbcmd -list /data/db/nt/ -list_outfmt '%n' | head -1 # => 23745273
Short answer – yes, about 0.23% of nt sequence descriptions contain the “>” character. Inspection of the output shows that it’s used in a number of ways. A few examples:
# as "brackets" (very common) emb|V01351.1| Sea urchin fragment, 3' to the actin gene in <SPAC01> gb|GU086899.1| Cotesia sp. jft03 voucher BIOUG<CAN>:CPWH-0042 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial # to indicate mutated bases or amino acids gb|M21581.1|SYNHUMUBA Synthetic human ubiquitin gene (Thr14->Cys), complete cds dbj|AB047520.1| Homo sapiens gene for PER3, exon 1, 128+22(G->A) polymorphism # in chemical nomenclature gb|AF134414.1|AF134414 Homo sapiens B-specific alpha 1->3 galactosyltransferase (ABO) mRNA, ABO-*B101 allele, complete cds # as "arrows" gb|EU303182.1| Apoi virus note kitaoka-> canals ->NIMR nonstructural protein 5 (NS5) gene, partial cds ref|XM_001734501.1| Entamoeba dispar SAW760 5'->3' exoribonuclease, putative EDI_265520 mRNA, complete cds
Something to keep in mind when writing code to process FASTA format.