Monday, 30 March 2015

Improving the GBIF Backbone matching

In GBIF occurrence records are matched to a taxon in a backbone taxonomy using the species match API. This is important to reduce spelling variations and create consistent metrics and searches according to a single classification and synonymy.

Over the past years we have been alerted to various bad matches. Most of the reported issues refer to a false fuzzy match for a name missing in our backbone.

In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities.  The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself.  Here I explain some of the work currently underway to tackle the former, which is visible on the test environment.

1.Name parsing of undetermined species

In occurrences we see many names with a partly undetermined name such as Lucanus spec. Erroneously these rank markers have been treated as real species epithets and together with fuzzy matching resulted in poor results.

  • Xysticus sp. used to wrongly match Xysticus spiethi while it now just matches the genus Xysticus.
  • Triodia sp. used to match the family Poaceae while it now matches the genus

2. Damerau–Levenshtein distance algorithm

For scoring fuzzy matches we have so far applied the Jaro Winkler distance which is often used for matching person names. It tends to allow for rather fuzzy matches at the end of long strings. This is desirable for scientific names, but the allowed fuzziness was too big and we decided to revert to the classical and more predictable Damerau–Levenshtein distance. This reduces false positive fuzzy matches considerably even though we lost a few good matches at the same time.


Matching results

The distinct, verbatim classifications of 528 million records were passed through the original and the new fuzzy matching algorithms - this included 10.5 million distinct classifications in total.  The results show that 428 thousand classifications (4%), representing 5,323,758 occurrence records produce a different match. So far we have taken a random subsample of the records which change, and manually inspected the results - we can hardly spot any degression or wrong matches.

We have published the complete matching comparison as well as the subset of changed records at Zenodo as tab delimited files:

The schema of the files have 3 column families each with the scientificName, GBIF taxonKey and the higher DwC classification terms for every match record (verbatim prefixed with v_ , old matching with an _old suffix and the new matching results with plain terms, e.g. v_scientificName, scientificName_old, scientificName).

We are glad to receive any feedback on further improvements or bad matching results we need to fix in the next iteration of work. Please get in touch with Markus Döring,


Create distinct occurrence names table

CREATE TABLE markus.names AS 
SELECT count(*) as numocc, count(distinct datasetKey) as numdatasets, v_scientificName, v_kingdom, v_phylum, v_class, v_order_ as v_order, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
FROM prod_b.occurrence_hdfs 
GROUP BY v_scientificName, v_kingdom, v_phylum, v_class, v_order_, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
ORDER BY v_scientificName, numocc DESC

Lookup taxonkey with both old & new lookup

CREATE TABLE markus.name_matches AS

  prod.taxonKey as taxonKey_old,
  prod.scientificName as scientificName_old,
  prod.rank as rank_old,
  prod.status as status_old,
  prod.matchType as matchType_old,
  prod.confidence as confidence_old,
  prod.kingdomKey as kingdomKey_old,
  prod.phylumKey as phylumKey_old,
  prod.classKey as classKey_old,
  prod.orderKey as orderKey_old,
  prod.familyKey as familyKey_old,
  prod.genusKey as genusKey_old,
  prod.speciesKey as speciesKey_old,
  prod.kingdom as kingdom_old,
  prod.phylum as phylum_old,
  prod.class_ as class_old,
  prod.order_ as order_old, as family_old,
  prod.genus as genus_old,
  prod.species as species_old,

  uat.taxonKey as taxonKey,
  uat.scientificName as scientificName,
  uat.rank as rank,
  uat.status as status,
  uat.matchType as matchType,
  uat.confidence as confidence,
  uat.kingdomKey as kingdomKey,
  uat.phylumKey as phylumKey,
  uat.classKey as classKey,
  uat.orderKey as orderKey,
  uat.familyKey as familyKey,
  uat.genusKey as genusKey,
  uat.speciesKey as speciesKey,
  uat.kingdom as kingdom,
  uat.phylum as phylum,
  uat.class_ as class_,
  uat.order_ as order_, as family,
  uat.genus as genus,
  uat.species as species

    v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_subgenus, 
    match('PROD', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) prod, 
    match('UAT', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) uat
  FROM markus.names
) n;

Hive exports

CREATE TABLE markus.matches_changed 
SELECT * from markus.name_matches 
WHERE taxonKey!=taxonKey_old;

CREATE TABLE markus.matches_all 
SELECT * from markus.name_matches;

Friday, 27 March 2015

IPT v2.2 – Making data citable through DataCite

GBIF is pleased to release IPT 2.2, now capable of automatically connecting with either DataCite or EZID to assign DOIs to datasets. This new feature makes biodiversity data easier to access on the Web and facilitates tracking its re-use.

DataCite integration explained

DataCite specialises in assigning DOIs to datasets. It was established in 2009 with three fundamental goals(1):
  1. Establish easier access to research data on the Internet
  2. Increase acceptance of research data as citable contributions to the scholarly record
  3. Support research data archiving to permit results to be verified and re-purposed for future study

Wednesday, 26 November 2014

Upgrading our cluster from CDH4 to CDH5

A little over a year ago we wrote about upgrading from CDH3 to CDH4 and now the time had come to upgrade from CDH4 to CDH5. The short version: upgrading the cluster itself was easy, but getting our applications to work with the new classpaths, especially MapReduce v2 (YARN), was painful.

The Cluster

Our cluster has grown since the last upgrade (now 12 slaves and 3 masters), and we no longer had the luxury of splitting the machines to build a new cluster from scratch. So this was an in-place upgrade, using CDH Manager.

Upgrade CDH Manager

The first step was upgrading to CDH Manager 5.2 (from our existing 4.8). The Cloudera documentation is excellent so I don't need to repeat it here. What we did find was that the management service now requests significantly more RAM for it's monitoring services (minimum "happy" config of 14GB), to the point where our existing masters were overwhelmed. As a stop gap we've added a 4th old machine to the "masters" group, used exclusively for the management service. In the longer term we'll replace the 4 masters with 3 new machines that have enough resources. 

Upgrade Cluster Members

Again the Cloudera documentation is excellent but I'll just add a bit. The upgrade process will now ask if a JAVA jdk should be installed (an improvement over the old behaviour of just installing one anyway). That means we could finally remove the Oracle JDK 6 rpms that have been lying around on the machines. For some reason the Host Inspector still complains about OpenJDK 7 vs Oracle 7 but we've happily been running on OpenJDK 7 since early 2014, and so far so good with CDH5 as well. After the upgrade wizard finished we had to tweak memory settings throughout the cluster, including setting the "Memory Overcommit Validation Threshold" to 0.99, up from its (very conservative) default of 0.8. Cloudera has another nice blog post on figuring out memory settings for YARN. Additionally Hue's configuration required some attention because after the upgrade it had forgotten where Zookeeper and the HBase Thrift server were. All in all quite painless.

The Gotchas

Getting our software to work with CDH5 was definitely not painless. All of our problems stemmed from conflicting versions of jars, due either to changes in CDH dependencies, or in changes to how a user classpath is specified as having priority over that of Yarn/HBase/Oozie. Additionally it took some time to wrap our heads around the new artifact packaging used by YARN and HBase. Note that we also use Maven for dependency management.

We're not alone in our suffering at the hands of mismatched Guava versions (e.g. HADOOP-10101HDFS-7040), but suffer we did. We resorted to specifying version 14.0.1 in any of our code that touches Hadoop and more importantly HBase, and exclude any higher version guavas from our dependencies. This meant downgrading some actual code that was using guava 15, but was the easiest path to getting a working system.

We have many dependencies on Jackson 1.9 and 2+ throughout our code, so downgrading to match HBase's shipped 1.8.8 was not an option. It meant figuring out the classpath precedence rules described below, and solving the problems (like logging) that doing so introduced.

Logging in Java is a horrible mess, and with the number of intermingled projects required to make application software run on a Hadoop/HBase cluster it's not surprise that getting logging to work was brutal. We code to the SLF4J API and use Logback as our implementation of choice. The Hadoop world uses a mix of Java Commons Logging, java.util.logging, and log4j. We thought that meant we'd be clear if we used the same SLF4J API (1.7.5) and used the bridges (log4j-over-slf4j, jcl-over-slf4j, and jul-to-slf4j), which has worked for us up to now. <montage>Angry men smash things angrily over the course of days</montage> Turns out, there's a bug in the 1.7.5 implementation of log4j-over-slf4j, which blows up as we described over at YARN-2875. Short version - use 1.7.6+ in client code that attempts to use YARN and log4j-over-slf4j.

The crux of our problems was having our classpath loaded after the Hadoop classpath had been loaded, meaning old versions of our dependencies were loaded first. The new, surprisingly hard to find parameter that tells YARN to load your classpath first is "mapreduce.job.user.classpath.first". YARN also quizzically claims that the parameter is deprecated, but.. works for me.

Convincing Oozie to load our classpath involved another montage of angry faces. It uses the same parameter as YARN, but with a prefix, so what you want is "oozie.launcher.mapreduce.job.user.classpath.first". We had been loading the old parameter "mapreduce.task.classpath.user.precedence" in each action in the workflow using the <job-xml> tag to load the configs from a file called hive-default.xml. We then encountered two problems: 
  1. Note the name - we used hive-default.xml instead of hive-site.xml because of a bug in Oozie (discussed here and here). That was fixed in the CDH5.2 Oozie, but we didn't get the memo. Now the file is called hive-site.xml and contains our specific configs and is again being picked up. BUT:
  2. Adding oozie.launcher.mapreduce.job.user.classpath.first to hive-site.xml doesn't work! As we wrote up in Oozie bug OOZIE-2066 this parameter has to be specified for each action, at the action level, in the workflow.xml. Repeating the example workaround from the bug report:
 <action name="run-test">  
  <ok to="end" />  
  <error to="kill" />  

New Packaging Woes

We build our jars using a combination of jar-with-dependencies and the shade plugin, but in both cases it means all our dependencies are built in. The problems come when a downstream, transitive dependency loads a different (typically older) version of one of the jars we've bundled in our main jar. This happens a lot with the Hadoop and HBase artifacts, especially when it comes to MR1 and logging.

Example exclusions

hbase-server (needed to run MapReduce over HBase):

hbase-testing-util (needed to run mini clusters):


hadoop-client (removing logging):

Beyond just sorting conflicting dependencies, we also encountered a problem that presented as "No FileSystem for scheme: file". It turns out we had projects bringing in both hadoop-common and hadoop-hdfs, and so we were getting only one of the META-INF/services files in the final jar.  Thus we could not use the FileSystem to read local files (like jars for the class path) and also from HDFS.  The fix was to include the org.apache.hadoop.fs.FileSystem in our project explicitly:

Finally we had to stop the TableMapReduceUtil from bringing in it’s own dependent jars, which brought in yet more conflicting jars - this appears to be a change in the default behaviour, where dependent jars are now being brought in by default in the shorter versions of initTableMapper:


As you can see the client side of the upgrade was beset on all sides by the iniquities of jars, packaging and old dependencies. It seems strange that upgrading Guava is considered a no-no, major breaking change by these projects, yet discussions about removing HBaseTablePool are proceeding apace and will definitely break many projects (including any of ours that touch HBase). While we're ultimately pleased that everything now works, and looking forward to benefiting from the performance improvements and new features of CDH5, it wasn't a great trip. Hopefully our experience will help others migrate more smoothly.

Tuesday, 6 May 2014

Multimedia in GBIF

We are happy to announce another long awaited improvement to the GBIF portal. Our portal test environment now shows multimedia items and their metadata associated with occurrences. As of today we have nearly 700 thousand occurrences with multimedia indexed. Based on the Dublin Core type vocabulary we distinguish between images, videos and sound files. As has been requested by many people the media type is available as a new filter in the occurrence search and subsequently in downloads. For example you can now easily find all audio recordings of birds.

UAM:Mamm:11470 - Eumetopias jubatus - skull
If you follow to the details page of any of those records you can see that sound files show up as simple links to the media file. We do the same for video files and currently do not have plans to embed any media player in our portal. This is different from images which are shown in a dedicated gallery you might have encountered for species pages before already. On the left you can see an example of a skull specimen with multiple images.

When requested for the first time, GBIF transiently caches the original images and processes them into various standard sizes and formats suitable for the use in the portal.

Publishing multimedia metadata

GBIF indexes multimedia metadata published in different ways within the GBIF network. From a simple URL given as an additional field in Darwin Core via multiple items expressed as ABCD XML or a dedicated multimedia extension in Darwin Core archives the difference usually is in metadata expressiveness.

Simple Darwin Core

Melocactus intortus record in iNaturalist
Whenever we spot the term dwc:associatedMedia in xml or Darwin Core archives as part of the a simple, flat occurrence record we try to extract URLs to media items. As the term officially allows for concatenated lists of URLs we try common delimiters such as comma, semicolon or the pipe symbols. An example of multiple, concatenated image URLs can be found in iNaturalist:

As you can see on the right every extracted link is regarded as a separate media item as there is no standard way to detect that 2 links refer to the same item. In the example above every image has a link to the actual image file and another one to the respective html page where it's metadata is presented. There is also no way to specify additional metadata about a link. As a consequence all images based on dwc:associatedMedia do not have a title, license or any further information. The verbatim data for that record before we extract image links can be seen here:

Darwin Core archive multimedia extension

By having a dedicated extension for media items many media items per core occurrence record can be published in a structured way. This is the GBIF recommended way to publish multimedia as it gives you most control over your metadata. Note that the same extension can also be used to publish multimedia for species in checklist datasets. This extension, based entirely on existing Dublin Core terms, allows you to specify the following information about a media item, all of which will make it into the GBIF portal if provided:

  •  dc:type, the kind of media item based on the DCMI Type Vocabulary:  StillImage, MovingImage or Sound
  •  dc:format, MIME type of the multimedia object's format 
  •  dc:identifier, the public URL that identifies and locates the media file directly, not the html page it might be shown on
  •  dc:references, the URL of an html webpage that shows the media item or its metadata. It is recommended to provide this url even if a media file exists as it will be used for linking out
  •  dc:title, the media items title
  •  dc:description, a textual description of the content of the media item
  •  dc:created, the date and time this media item was taken
  •  dc:creator, the person that took the image, recorded the video or sound
  •  dc:contributor, any contributor in addition to the creator that helped in recording the media item
  •  dc:publisher, the name of an entity responsible for making the image available
  •  dc:audience, a class or description for whom the image is intended or useful
  •  dc:source, a reference to the source the media item was derived or taken from. For example a book from which an image was scanned or the original provider of a photo/graphic, such as photography agencies
  •  dc:license, license for this media object. If possible declare it as CC0 to ensure greatest use
  •  dc:rightsHolder, the person or organization owning or managing rights over the media item

Access to Biological Collections Data

As usual we also provide a binding from the TDWG ABCD standard (versions 1.2 and 2.06) mostly used with the BioCASE software.

From ABCD 1.2 we extract media information based on the UnitDigitalImage subelements. In particular information about the file URL (ImageURI), the description (Comment) and the license (TermsOfUse).

In ABCD 2.06 we use the unit MultiMediaObject subelements instead. Here there are distinct file and webpage URLs (FileURI, ProductURI), the description (Comment),  the license (License/Text, TermsOfUseStatements) and also an indication of the mime type (Format). The bird sound example from above comes in as ABCD 2.06 via the Animal Sound Archive dataset. You can see the original details of that ABCD record in it's raw XML fragment. There are also fossil images available through ABCD.

Missing from both ABCD versions is a media title, creator and created element.

Media type interpretation

We derive the media type from either an explicitly given dc:type, the mime type found in dc:format or the media file suffix. In the case of dwc:associatedMedia found in simple Darwin Core we can only rely on the file URL to interpret the kind of media item. If that URL is pointing to some html page instead of an actual static media file with a wellknown suffix the media type remains unknown.

Production deployment

We hope you like this new feature and we are eager to get this out into production within the next weeks. This is the first iteration of this work, and like all GBIF developments we welcome any feedback.

Wednesday, 23 April 2014

IPT v2.1 – Promoting the use of stable occurrenceIDs

GBIF is pleased to announce the release of the IPT 2.1 with the following key changes:
  • Stricter controls for the Darwin Core occurrenceID to improve stability of record level identifiers network wide
  • Ability to support Microsoft Excel spreadsheets natively
  • Japanese translation thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan
With this update, GBIF continues to refine and improve the IPT based on feedback from users, while carrying out a number of deliverables that are part of the GBIF Work Programme for 2014-16.

The most significant new feature that has been added in this release is the ability to validate that each record within an Occurrence dataset has a unique identifier. If any missing or duplicate identifiers are found, publishing fails, and the problem records are logged in the publication report.

This new feature will support data publishers who use the Darwin Core term occurrenceID to uniquely identify their occurrence records. The change is intended to make it easier to link to records as they propagate throughout the network, simplifying the mechanism to cross reference databases and potentially help towards tracking use.

Previously, GBIF has asked publishers to use the three Darwin Core terms: institutionCode, collectionCode, and catalogNumber to uniquely identify their occurrence records. This triplet style identifier will continue to be accepted, however, it is notoriously unstable since the codes are prone to change and in many cases are meaningless for datasets originating from outside of the museum collections community. For this reason, GBIF is adopting the recommendations coming from the IPT user community and recommending the use of occurrenceID instead.

Best practices for creating an occurrenceID are that they (a) must be unique within the dataset, (b) should remain stable over time, and (c) should be globally unique wherever possible. By taking advantage of the IPT’s built-in identifier validation, publishers will automatically satisfy the first condition.

Ultimately, GBIF hopes that by transitioning to more widespread use of stable occurrenceIDs, the following goals can be realized:
  • GBIF can begin to resolve occurrence records using an occurrenceID. This resolution service could also help check whether identifiers are globally unique or not.
  • GBIF’s own occurrence identifiers will become inherently more stable as well.
  • GBIF can sustain more reliable cross-linkages to its records from other databases (e.g. GenBank).
  • Record-level citation can be made possible, enhancing attribution and the ability to track data usage.
  • It will be possible to consider tracking annotations and changes to a record over time.
If you’re a new or existing publisher, GBIF hope you’ll agree these goals are worth working towards, and start using occurrenceIDs.

The IPT 2.1 also includes support for uploading Excel files as data sources.

Another enhancement is that the interface has been translated into Japanese. GBIF offer their sincere thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan for this extraordinary effort.

In the 11 months since version 2.0.5 was released, a total of 11 enhancements have been added, and 38 bugs have been squashed. So what else has been fixed?

If you like the IPT’s auto publishing feature, you will be happy to know the bug causing the temporary directory to grow until disk space was exhausted has now been fixed. Resources that are configured to auto publish, but fail to be published for whatever reason, are now easily identifiable within the resource tables as shown:

If you ever created a data source by connecting directly to a database like MySQL, you may have noticed an error that caused datasets to truncate unexpectedly upon encountering a row with bad data. Thanks to a patch from Paul Morris (Harvard University Herbaria) bad rows now get skipped and reported to the user without skipping subsequent rows of data.

As always we’d like to give special thanks to the other volunteers who contributed to making this version a reality:
On behalf of the GBIF development team, I can say that we’re really excited to get this new version out to everyone! Happy publishing.

Tuesday, 4 March 2014

Lots of columns with Hive and HBase

We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from Darwin Core. Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like!

Or so we thought.

Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this Cloudera blog). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase.

Here's an example of how to write a Hive table definition for an HBase-backed table:

CREATE EXTERNAL TABLE tiny_hive_example (
  key INT,
  kingdom STRING,
  kingdomkey INT
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b")
  "" = "tiny_hbase_table",
  "" = "binary"

But now that we have something like 600 columns to map to HBase, and that we've chosen to name our HBase columns just like the DwC Terms they represent (e.g. the basis of record term's column name is basisOfRecord) we have a very long "SERDEPROPERTIES" string in our Hive table definition. How long? Well, way more than the 4000 character limit of Hive. For our Hive metastore we use PostgreSQL and when Hive creates the SERDE_PARAMS table it gives the PARAM_VALUE column a datatype of VARCHAR(4000). Because 4k should be enough for anyone, right? Sigh.

The solution:

alter table "SERDE_PARAMS" alter column "PARAM_VALUE" type text;

We did lots of testing to make sure the existing definitions didn't get nuked by this change, and can confirm that the Hive code is not checking that 4000 value either (value is turned into a String: the source). Our new super-wide downloads table works, and will be in production soon!

Monday, 28 October 2013

The new (real-time) GBIF Registry has gone live

For the last 4 years, GBIF has operated the GBRDS registry with its own web application on  Previously, when a dataset got registered in the GBRDS registry (for example using an IPT) it wasn't immediately visible in the portal for several weeks until after rollover took place. 

In October, GBIF launched its new portal on  During the launch we indicated that the real-time data management would be starting up in November.  We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry.   
What does this mean for you?
  • any dataset registered through GBIF (using an IPT, web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated 
  • the GBRDS web application ( is no longer visiblesince the new portal displays all the appropriate information
  • the GBRDS sandbox registry web application ( is no longer visible, but a new registry sandbox has been setup to provide for IPT installations installed in test mode
Please note that the new registry API supports the same web service API that the GBRDS previously did, so existing tools and services built on the GBRDS API (such as the IPT) will continue to work uninterrupted. 
As you may have noticed, occurrence data crawling has been temporarily suspended since the middle of September to prepare for launching real-time data managementWe aim to resume occurrence data crawling in the first week of November, meaning that updates to the index will be visible immediately afterwards.  
On behalf of the GBIF development team, I thank you for your patience during this transition time, and hope you are looking forward to real-time data management as much as we are.