What’s this, a useful method with practical application in Bioinformatics? (Just kidding guys).
The article describes CD-HIT (so no bonus points for a catchy name). It’s the algorithm that RCSB/PDB and UniProt use to cluster their sequences into non-redundant datasets. Fast, by all accounts, though I’m not sure that a dual Xeon with 4 GB RAM is everyone’s workstation. Perhaps I need to be pushier with my boss. Anyway, the article is open access.