Monday 18 July 2011

Working with Scientific Names

Dealing with scientific names is an important regular part of our work at GBIF. Scientific names are highly structured strings with a syntax governed by a nomenclatural code. Unfortunately there are different ones for botany, zoology, bacteria, virus and even cultivar names. When dealing with names we often do not know to which code or classification it belongs to, so we need to have a code agnostic representation as much as possible. GBIF came up with a structured representation which is a compromise focusing on the most common names, primarily the botanical and zoological names which are quite similar in its basic form.

The ParsedName class

Our ParsedName class provides us with the following core properties:
 genusOrAbove
infraGeneric
specificEpithet
rankMarker
infraSpecificEpithet
authorship
year
bracketAuthorship
bracketYear
These allow us to represent regular names properly. For example Agalinis purpurea var. borealis (Berg.) Peterson 1987 is represented as
 genusOrAbove=Agalinis
specificEpithet=purpurea
rankMarker=var.
infraSpecificEpithet=borealis
authorship=Peterson
year=1987
bracketAuthorship=Berg.
or the botanical section Maxillaria sect. Multiflorae Christenson as
 genusOrAbove=Maxillaria
infraGeneric=Multiflorae
rankMarker=sect.
authorship=Christenson
Especially in botany you often encounter names with authorships for both the species and some infraspecific rank or names citing more than one infraspecific rank. These names are not formed based on rules or recommendations from the respective codes and we ignore those superflous parts. For example Agalinis purpurea (L.) Briton var. borealis (Berg.) Peterson 1987 is represented exactly the same as Agalinis purpurea var. borealis above. In case of 4 parted names like Senecio fuchsii C.C.Gmel. subsp. fuchsii var. expansus (Boiss. & Heldr.) Hayek only the lowest infraspecific rank is preserved:
 genusOrAbove=Senecio
specificEpithet=fuchsii
rankMarker=var.
infraSpecificEpithet=expansus
authorship=Hayek
bracketAuthorship=Boiss. & Heldr.
Hybrid names are evil. They come in two flavors, named hybrids and hybrid formulas.
Named hybrids are not so bad and simply prefix a name part with the multiplication sign ×, the hybrid marker, or prefix the rank marker of infraspecific names with notho. Strictly this symbol is not part of the genus or epithet. To represent these notho taxa our ParsedName class contains a property called nothoRank that keeps the rank or part of the name that needs to be marked as with the hybrid sign. For example the named hybrid Pyrocrataegus ×willei L.L.Daniel is represented as
 genusOrAbove=Pyrocrataegus
specificEpithet=willei
authorship=L.L.Daniel
nothoRank=species
Hybrid formulas such as Agrostis stolonifera L. × Polypogon monspeliensis (L.) Desf., Asplenium rhizophyllum × ruta-muraria or Mentha aquatica L. × M. arvensis L. × M. spicata L. cannot be represented by our class. The hybrid formulas in theory can combine any number of names or name parts, so its hard to deal with them. Luckily they are not very common and we can afford to live with a complete string representation in those cases. Yet another "extension" to the botanical code are cultivar names, i.e. names for plants in horticulture. Cultivar names are regular botanical names followed by a cultivar name usually in english given in single quotes. For example Cryptomeria japonica 'Elegans'. To keep track of this we have an additional cultivar property, so that:
 genusOrAbove=Cryptomeria
specificEpithet=japonica
cultivar=Elegans
In taxonomic works you often have additional information in a name that details the taxonomic concept, the sec reference most often prefixed by sensu or sec. For example Achillea millefolium sec. Greuter 2009 or Achillea millefolium sensu latu. In nomenclatoral works one frequently encounters nomenclatoral notes about the name such as nom.illeg. or nomen nudum. Both these informations are hold in our ParsedName class, for example Solanum bifidum Vell. ex Dunal, nomen nudum becomes
 genusOrAbove=Solanum
specificEpithet=bifidum
authorship=Vell. ex Dunal
nomStatus=nomen nudum

Reconstructing name strings

The ParsedName class provides us with some common methods to build a name string. In many cases you dont want the complete name with all its details, so we offer some popular name string types out of the box and a flexible string builder that you can explicitly tell which parts you want to include. The most important methods are canonicalName(): builds the canonical name sensu strictu with nothing else but the three name parts at max (genus, species, infraspecific). No rank, hybrid markers or authorship information are included. fullName(): builds the full name with all details that exist. canonicalSpeciesName(): builds the canonical binomial in case of species or below, ignoring infraspecific information

The name parser

We decided at GBIF that sharing the complete name string is more reliable than trusting already parsed names. But parsing names by hand is a very tedious enterprise, so we needed to develop some parser that can handle the vast majority of all names that we encounter. After a short experimental phase with BNF and other grammars to automatically build a parser we decided to go back to start and start something based on good old regular expressions and plain java code. The parser has evolved now for nearly 2 years now and it might be the best unit tested class we have ever written at GBIF. It is interesting to take a look at the range of names we use for testing and also the test themseves to make sure its working as expected.

Parsing names

Using the NameParser in code is trivial. Once you create a parser instance all you need to do is call the parser.parse(String name) method to get your ParsedName object. As authorships are the hardest, ie most variable part of a name we have actually implemented two parsers internally. One that tries to parse the complete string and another fallback one that ignores authorships and only extracts the canonical name. The authorsParsed flag on a ParsedName instance tells you if the simpler fallback parser has been used. If a name cannot be parsed at all an UnparsableException is thrown. This is also the case for viral names and hybrid formulas, as the ParsedName class cannot treat these names. The exception itself actually has an enumerated property that you can use to know if the exception has been caused by a virus, hybrid or other name. As of today from 10.114.724 unique name strings that we have indexed only 116.000 names couldnt be parsed and these are mostly hybrid formulas.

Normalisation

Apart from the parse method the name parser also exposes a normalisation method to normalise any whitespace, commas, brackets and hybrid markers found in name strings. The parser uses this method internally before the actual parsing takes place. The string is trimmed, only single whitespace is allowed and spaces before commas are removed while it is enforced after a comma. Similar whitespace before opening brackets is added but removed inside. Instead of the proper multiplication sign for hybrids often a simple × followed by whitespace is used which is also replaced by this method.

Parsing Webservices

GBIF is offering a free webservice API to parse names using our name parser. We use JSON for the parsed results and it accepts single names as well as batches of names. For larger input data you have to use a POST request (GET requests are restricted in length), but for few names also a simple GET request with the names url encoded in the paramter "names" is accepted. Multiple names can be concatenated with the pipe | symbol. To parse the two names Symphoricarpos albus (L.) S.F.Blake cv. 'Turesson' and Pyrocrataegus willei ×libidi L.L.Daniel the parser service call looks like this: http://ecat-dev.gbif.org/ws/parser?names=Symphoricarpos%20albus%20(L.)%20S.F.Blake%20cv.%20'Turesson'|Stagonospora%20polyspora%20M.T.%20Lucas%20%26%20Sousa%20da%20Camara%201934 For manual usages we also provide a simple web client to this service that provides a form to enter names to be parsed and also accepts files with one name per line for upload. It is available as part of our tools collection at http://tools.gbif.org/nameparser/.

Source Code

All the code is part of a small java library that we call ecat-common. It is freely available under Apache 2 licensing as most of our GBIF work and you are invited to use our code at http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/, download the latest jar from our maven repository or include it in your maven dependencies like this:
 <repositories>
<repository>
<id>gbif-all</id>
<url>http://repository.gbif.org/content/groups/gbif</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.gbif</groupId>
<artifactId>ecat-common</artifactId>
<version>1.5.1-SNAPSHOT</version>
</dependency>
</dependencies>

No comments:

Post a Comment