Over the past years we have been alerted to various bad matches. Most of the reported issues refer to a false fuzzy match for a name missing in our backbone.
In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities. The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself. Here I explain some of the work currently underway to tackle the former, which is visible on the test environment.
1.Name parsing of undetermined speciesIn occurrences we see many names with a partly undetermined name such as Lucanus spec. Erroneously these rank markers have been treated as real species epithets and together with fuzzy matching resulted in poor results.
- Xysticus sp. used to wrongly match Xysticus spiethi while it now just matches the genus Xysticus.
- Triodia sp. used to match the family Poaceae while it now matches the genus
2. Damerau–Levenshtein distance algorithmFor scoring fuzzy matches we have so far applied the Jaro Winkler distance which is often used for matching person names. It tends to allow for rather fuzzy matches at the end of long strings. This is desirable for scientific names, but the allowed fuzziness was too big and we decided to revert to the classical and more predictable Damerau–Levenshtein distance. This reduces false positive fuzzy matches considerably even though we lost a few good matches at the same time.
- Xyris kralii Wand. used to match to Xyris harleyi but now just matches to the genus Xyris L. as the species is missing from our backbone.
- Zea mays subsp. parviglumis var. huehuet Iltis & Doebley used to match Zea mays var. hirta while it now just hits the species Zea mays L.
The distinct, verbatim classifications of 528 million records were passed through the original and the new fuzzy matching algorithms - this included 10.5 million distinct classifications in total. The results show that 428 thousand classifications (4%), representing 5,323,758 occurrence records produce a different match. So far we have taken a random subsample of the records which change, and manually inspected the results - we can hardly spot any degression or wrong matches.
We have published the complete matching comparison as well as the subset of changed records at Zenodo as tab delimited files:
Dataset 1: All classification matches (10.5 million)
Dataset 2: Changed matches (428 thousand)
The schema of the files have 3 column families each with the scientificName, GBIF taxonKey and the higher DwC classification terms for every match record (verbatim prefixed with v_ , old matching with an _old suffix and the new matching results with plain terms, e.g. v_scientificName, scientificName_old, scientificName).
We are glad to receive any feedback on further improvements or bad matching results we need to fix in the next iteration of work. Please get in touch with Markus Döring, firstname.lastname@example.org.
Create distinct occurrence names table
CREATE TABLE markus.names AS SELECT count(*) as numocc, count(distinct datasetKey) as numdatasets, v_scientificName, v_kingdom, v_phylum, v_class, v_order_ as v_order, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification FROM prod_b.occurrence_hdfs GROUP BY v_scientificName, v_kingdom, v_phylum, v_class, v_order_, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification ORDER BY v_scientificName, numocc DESC
Lookup taxonkey with both old & new lookup
CREATE TABLE markus.name_matches AS SELECT n.numocc, n.numdatasets, n.v_scientificName, n.v_kingdom, n.v_phylum, n.v_class, n.v_order, n.v_family, n.v_genus, n.v_subgenus, n.v_specificEpithet, n.v_infraspecificEpithet, n.v_scientificNameAuthorship, n.v_taxonrank, n.v_higherClassification, prod.taxonKey as taxonKey_old, prod.scientificName as scientificName_old, prod.rank as rank_old, prod.status as status_old, prod.matchType as matchType_old, prod.confidence as confidence_old, prod.kingdomKey as kingdomKey_old, prod.phylumKey as phylumKey_old, prod.classKey as classKey_old, prod.orderKey as orderKey_old, prod.familyKey as familyKey_old, prod.genusKey as genusKey_old, prod.speciesKey as speciesKey_old, prod.kingdom as kingdom_old, prod.phylum as phylum_old, prod.class_ as class_old, prod.order_ as order_old, prod.family as family_old, prod.genus as genus_old, prod.species as species_old, uat.taxonKey as taxonKey, uat.scientificName as scientificName, uat.rank as rank, uat.status as status, uat.matchType as matchType, uat.confidence as confidence, uat.kingdomKey as kingdomKey, uat.phylumKey as phylumKey, uat.classKey as classKey, uat.orderKey as orderKey, uat.familyKey as familyKey, uat.genusKey as genusKey, uat.speciesKey as speciesKey, uat.kingdom as kingdom, uat.phylum as phylum, uat.class_ as class_, uat.order_ as order_, uat.family as family, uat.genus as genus, uat.species as species FROM ( SELECT numocc, numdatasets, v_scientificName, v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification, match('PROD', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) prod, match('UAT', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) uat FROM markus.names ) n;
CREATE TABLE markus.matches_changed ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' NULL DEFINED AS '' AS SELECT * from markus.name_matchesWHERE taxonKey!=taxonKey_old;
CREATE TABLE markus.matches_all ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' NULL DEFINED AS '' AS SELECT * from markus.name_matches;