Scientific name, also known as binomial nomenclature and sometimes known as the Latin name of the species, has served the scientific world well since its introduction in the early 18th century by the great Swedish botanist, physician, and zoologist Carl Linnaeus. Each scientific name consists of a genus name and a short species name and is unique for each organism. It has greatly facilitated almost every aspect of biological sciences by its widespread use and stability. It uniquely identifies the same organism all over the world, in all languages without the need of translation.
Although a short species name is normally used, the whole scientific name is often too long to write it out each time it is being used. Now it is customary to abbreviate the genus name by just using first capital letter and then a period to represent the genus (e.g., E. coli) after a scientific name is initially written in full once in the article to save space. Even so, most of the scientific names are still too long to be used by many of the genomic analysis tools (e.g., PHYLIP, MEGA) or directly on the trees or figures. This problem is further compounded by presence of subspecies and strain names because together they could be more than 70 letters long (e.g., Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633). Since we entered the genomic era in 1995 when Haemophilus influenzae was first sequenced, thousands of genomes have been sequenced and the number of genomes that are being sequenced is still growing exponentially. It is impossible to use either full or abbreviated scientific names in any large comparative genomic analyses. To further alleviate this problem with long scientific names, many authors including us have started to use short acronyms in place of the long scientific names in our comparative genomic studies. But since there is not a rule on how acronym should be created, the acronyms we created are not universal and we are thus forced to include a large table or a long figure legend for lookup of each acronym-full name pair, which has wasted a lot of time and effort by both authors and readers. To completely solve this problem we are facing in this genomic era, we need a universal acronym system now.
This universal acronym system will generate a unique acronym for each organism in the NCBI taxonomy database at the lowest taxonomical level. In this system, a typical acronym is created at the species level (with the rank of 'species') and it will consist of four letters (e.g., Ecol for Escherichia coli), a capital letter representing the first initial of the genus name and three lowercase letters representing the first three letters of the species name. If an acronym has already been used, a numerical number is used to designate a different species (e.g., Ecol1 for Enterococcus columbae). To differentiate the strains under a species, capital letter(s) is added to the end of species-level acronym after an underscore line (e.g., Ecol_A for Escherichia coli K-12). The capital letters will be automatically incremented (A..Z, AA..ZZ, ...) to represent multiple strains of the same species. For a species with an undefined species name 'sp.', the four-letter acronym all comes from the genus name and in all capital letters (e.g., ACIN for Acinetobacter sp. and ACIN1 for Acinetobacter sp. ADP1). For bacterial species that start with 'Candidatus' in front of the regular species name, the same rules are used in creating the acronyms for the regular scientific name after the 'Candidatus' is removed (e.g., Bpen2_A for Candidatus Blochmannia pennsylvanicus str. BPEN).
Despite the rules for scientific names favor stability and uniqueness, in practice a single species may have several different scientific names (synonyms). Often a synonym (with name_class of 'synonym') is preferred over the scientific name (with name_class of 'scientific name') by some authors. In such cases, only the scientific name is used in creating the acronym and all the synonyms will get the same acronym as the scientific name. All of the acronyms are stored in a relational database along with all the data from the NCBI Taxonomy database. This will allow the acronyms be updated when the scientific names are changed in each new release of the NCBI Taxonomy database. For examples, when species are transferred between genera, a new scientific name will be used and the old name becomes a synonym. In such case, a new acronym will be created for the new scientific name. Similarly if what were previously thought to be distinct species are demoted from species to a lower rank in the new release of the NCBI Taxonomy database, the old acronym will be retained for the former species name and a new acronym will be created to correspond to the new rank. All the acronyms, old and new are stored in the acronym database. As a result, multiple acronyms may be used for the same organism, but the same acronyms will never be used for different species. The acronym database allows search by either acronyms or scientific names (as well as synonyms). It also allows direct hyperlinks to the NCBI TaxBrowser for more information. In addition to the acronym database, we created a web-based acronym conversion tool. This tool allows users to covert a list of scientific names to the corresponding universal acronyms. It also allows users to upload a sequence file in fasta format downloaded from the GenBank and the tool will generate a new fasta sequence file with the start of each description line replaced by the new acronym so the new file can be used directly in comparative genome analyses. This tool also allow users to handle misspelled scientific names and enter scientific names that have not been entered in the NCBI Taxonomy database.