Wednesday, 23 April 2014

IPT v2.1 – Promoting the use of stable occurrenceIDs

GBIF is pleased to announce the release of the IPT 2.1 with the following key changes:
  • Stricter controls for the Darwin Core occurrenceID to improve stability of record level identifiers network wide
  • Ability to support Microsoft Excel spreadsheets natively
  • Japanese translation thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan
With this update, GBIF continues to refine and improve the IPT based on feedback from users, while carrying out a number of deliverables that are part of the GBIF Work Programme for 2014-16.

The most significant new feature that has been added in this release is the ability to validate that each record within an Occurrence dataset has a unique identifier. If any missing or duplicate identifiers are found, publishing fails, and the problem records are logged in the publication report.

This new feature will support data publishers who use the Darwin Core term occurrenceID to uniquely identify their occurrence records. The change is intended to make it easier to link to records as they propagate throughout the network, simplifying the mechanism to cross reference databases and potentially help towards tracking use.

Previously, GBIF has asked publishers to use the three Darwin Core terms: institutionCode, collectionCode, and catalogNumber to uniquely identify their occurrence records. This triplet style identifier will continue to be accepted, however, it is notoriously unstable since the codes are prone to change and in many cases are meaningless for datasets originating from outside of the museum collections community. For this reason, GBIF is adopting the recommendations coming from the IPT user community and recommending the use of occurrenceID instead.

Best practices for creating an occurrenceID are that they (a) must be unique within the dataset, (b) should remain stable over time, and (c) should be globally unique wherever possible. By taking advantage of the IPT’s built-in identifier validation, publishers will automatically satisfy the first condition.

Ultimately, GBIF hopes that by transitioning to more widespread use of stable occurrenceIDs, the following goals can be realized:
  • GBIF can begin to resolve occurrence records using an occurrenceID. This resolution service could also help check whether identifiers are globally unique or not.
  • GBIF’s own occurrence identifiers will become inherently more stable as well.
  • GBIF can sustain more reliable cross-linkages to its records from other databases (e.g. GenBank).
  • Record-level citation can be made possible, enhancing attribution and the ability to track data usage.
  • It will be possible to consider tracking annotations and changes to a record over time.
If you’re a new or existing publisher, GBIF hope you’ll agree these goals are worth working towards, and start using occurrenceIDs.

The IPT 2.1 also includes support for uploading Excel files as data sources.

Another enhancement is that the interface has been translated into Japanese. GBIF offer their sincere thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan for this extraordinary effort.

We have also been lucky to collaborate with Yasumasa Shigemoto, also from the NIG, who has helped evaluate specific security aspects of the IPT. By using a security inspection utility, Vulnerability Explorer (VEX), a number of potential security vulnerabilities where discovered and addressed. The IPT 2.1 passes the VEX inspection, where previous releases raised warnings.

In the 11 months since version 2.0.5 was released, a total of 11 enhancements have been added, and 38 bugs have been squashed. So what else has been fixed?

If you like the IPT’s auto publishing feature, you will be happy to know the bug causing the temporary directory to grow until disk space was exhausted has now been fixed. Resources that are configured to auto publish, but fail to be published for whatever reason, are now easily identifiable within the resource tables as shown:

If you ever created a data source by connecting directly to a database like MySQL, you may have noticed an error that caused datasets to truncate unexpectedly upon encountering a row with bad data. Thanks to a patch from Paul Morris (Harvard University Herbaria) bad rows now get skipped and reported to the user without skipping subsequent rows of data.

As always we’d like to give special thanks to the other volunteers who contributed to making this version a reality:
On behalf of the GBIF development team, I can say that we’re really excited to get this new version out to everyone! Happy publishing.

Tuesday, 4 March 2014

Lots of columns with Hive and HBase

We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from Darwin Core. Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like!

Or so we thought.

Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this Cloudera blog). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase.

Here's an example of how to write a Hive table definition for an HBase-backed table:

CREATE EXTERNAL TABLE tiny_hive_example (
  key INT,
  kingdom STRING,
  kingdomkey INT
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b")
  "" = "tiny_hbase_table",
  "" = "binary"

But now that we have something like 600 columns to map to HBase, and that we've chosen to name our HBase columns just like the DwC Terms they represent (e.g. the basis of record term's column name is basisOfRecord) we have a very long "SERDEPROPERTIES" string in our Hive table definition. How long? Well, way more than the 4000 character limit of Hive. For our Hive metastore we use PostgreSQL and when Hive creates the SERDE_PARAMS table it gives the PARAM_VALUE column a datatype of VARCHAR(4000). Because 4k should be enough for anyone, right? Sigh.

The solution:

alter table "SERDE_PARAMS" alter column "PARAM_VALUE" type text;

We did lots of testing to make sure the existing definitions didn't get nuked by this change, and can confirm that the Hive code is not checking that 4000 value either (value is turned into a String: the source). Our new super-wide downloads table works, and will be in production soon!

Monday, 28 October 2013

The new (real-time) GBIF Registry has gone live

For the last 4 years, GBIF has operated the GBRDS registry with its own web application on  Previously, when a dataset got registered in the GBRDS registry (for example using an IPT) it wasn't immediately visible in the portal for several weeks until after rollover took place. 

In October, GBIF launched its new portal on  During the launch we indicated that the real-time data management would be starting up in November.  We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry.   
What does this mean for you?
  • any dataset registered through GBIF (using an IPT, web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated 
  • the GBRDS web application ( is no longer visiblesince the new portal displays all the appropriate information
  • the GBRDS sandbox registry web application ( is no longer visible, but a new registry sandbox has been setup to provide for IPT installations installed in test mode
Please note that the new registry API supports the same web service API that the GBRDS previously did, so existing tools and services built on the GBRDS API (such as the IPT) will continue to work uninterrupted. 
As you may have noticed, occurrence data crawling has been temporarily suspended since the middle of September to prepare for launching real-time data managementWe aim to resume occurrence data crawling in the first week of November, meaning that updates to the index will be visible immediately afterwards.  
On behalf of the GBIF development team, I thank you for your patience during this transition time, and hope you are looking forward to real-time data management as much as we are. 

Thursday, 24 October 2013

GBIF Backbone in GitHub

For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy. Encouraged by similar sentiments from Rod Page, it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the great GitHub Treeslider to browse the taxonomy, so why not give it a try?

A GitHub filesystem taxonomy

I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:

  1., a simple markdown file that gets rendered by github and shows the basic attributes of a taxon
  2. data.json, a complete json representation of the taxon as it is exposed via the new GBIF species API
The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica is represented as:

This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.

Getting data into GitHub

It didn't take much to write a small class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop. 
Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and 670.000 files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.

Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)

After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even git itself, are not made to be a repository of millions of files and folders.

First GitHub impressions

Browsing taxa in GitHub is surprisingly responsive. The fungi genus Amanita  contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new GBIF species pages which of course shows much more information. The rendered readme file is not ideal as it's at the very bottom of the page, but showing information to humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s). 

The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the Amanita readme contains the synonym information) and also with infraspecific taxa. For example Amanita muscaria contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.

The GitHub browser sorts all files alphabetically. When mixing ranks (we skip intermediate unknown ranks in the backbone), for example see the Fungus kingdomsorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.

I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push & pull.

Monday, 22 July 2013

Validating scientific names with the forthcoming GBIF Portal web service API

This guest post was written by Gaurav Vaidya, Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously.

A whale named Physeter macrocephalus Physeter catodon Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons)
Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's Porphyrio martinicus, not Porphyrio martinica). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over whether the sperm whale is really Physeter catodon Linnaeus, 1758, or Physeter macrocephalus Linnaeus, 1758.

A good way to validate scientific names is to match them against a taxonomic checklist: a publication that describes the taxonomy of a particular taxonomic group in a particular geographical region. It is up to the taxonomists who write such treatises to catalogue all the synonyms that have ever been used for the names in their checklist, and to identify a single accepted name for each taxon they recognize. While these checklists are themselves evolving over time and sometimes contradict each other, they serve as essential points of reference in an ever-changing taxonomic landscape.

Over a hundred digitized checklists have been assembled by the Global Biodiversity Information Facility (GBIF) and will be indexed in the forthcoming GBIF Portal, currently in development and testing. This collection includes large, global checklists, such as the Catalogue of Life and the International Plant Names Index, alongside smaller, more focussed checklists, such as a checklist of 383 species of seed plants found in the Singhalila National Park in India and the 87 species of moss bug recorded in the Coleorrhyncha Species File. Many of these checklists can be downloaded as Darwin Core Archive files, an important format for working with and exchanging biodiversity data.

So how can we match names against these databases? OpenRefine (the recently-renamed Google Refine) is a popular data cleaning tool, with features that make it easy to clean up many different types of data. Javier Otegui has written a tutorial on cleaning biodiversity data in OpenRefine, and last year Rod Page provided tools and a step-by-step guide to reconciling scientific names, establishing OpenRefine as an essential tool for biodiversity data and scientific name cleanup.

Linnaeus' original description of Felis Tigris. From an 1894 republication of Linnaeus' Systema Naturae, 10th edition, digitized by the Biodiversity Heritage Library.
We extended Rod's work by building a reconciliation service against the forthcoming GBIF web services API. We wanted to see if we could use one of the GBIF Portal's biggest strengths -- the large number of checklists it has indexed -- to identify names recognized in similar ways by different checklists. Searching through multiple checklists containing possible synonyms and accepted names increases the odds of finding an obscure or recently created name; and if the same name is recognized by a number of checklists, this may signify a well-known synonymy -- for example, two of the Portal checklists recognize that the species Linnaeus named Felis tigris is the same one that is known as Panthera tigris today.

To do this, we wrote a new OpenRefine reconciliation service that searches for a queried name in all the checklists on the GBIF Portal. It then clusters names using four criteria and counts how often a particular name has the same:
  • scientific name (for example, "Felis tigris"),
  • authority ("Linnaeus, 1758"),
  • accepted name ("Panthera tigris"), and
  • kingdom ("Animalia").

Once you do a reconciliation through our new service, your results will look like this:

Since OpenRefine limits the number of results it shows for any reconciliation, we know only that at least five checklists in the GBIF Portal matched the name "Felis tigris". Of these,
  1. Two checklists consider Felis tigris Linnaeus, 1758 to be a junior synonym of Panthera tigris (Linnaeus, 1758). Names are always sorted by the number of checklists that contain that interpretation, so this interpretation -- as it happens, the correct one -- is at the top of the list.
  2. The remaining checklists all consider Felis tigris to be an accepted name in its own right. They contain mutually inconsistent information: one places this species in the kingdom Animalia, another in the kingdom Metazoa, and the third contains both a kingdom and an taxonomic authority. You can click on each name to find out more details.

Using our reconciliation service, you can immediately see how many checklists agree on the most important details of the name match, and whether a name should be replaced with an accepted name. The same name may also be spelled identically under different nomenclatural codes: for example, does "Ficus" refer to the genus Ficus Röding, 1798 or the genus Ficus L.? If you know that the former is in kingdom Animalia while the latter is in Plantae, it becomes easier to figure out the right match for your dataset.

We've designed a complete workflow around our reconciliation service, starting with ITIS as a very fast first step to catch the most well recognized names, and ending with EOL's fuzzy matching search as a final step to look for incorrectly spelled names. For VertNet's 2013 Biodiversity Informatics Training Workshop, we wrote two tutorials that walk you through our workflow:

If you're already familiar with OpenRefine, you can add the reconciliation service with the URL:
Give it a try, and let us know if it helps you reconcile names faster!

The Map of Life project is continuing to work on improving OpenRefine for taxonomic use in a project we call TaxRefine. If you have suggestions for features you'd like to see, please let us know! You can leave a comment on this blog post, or add an issue to our issue tracker on GitHub.

Wednesday, 22 May 2013

IPT v2.0.5 Released - A melhor versão até o momento!

The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website here.

As with every release, it's your chance to take advantage of the most requested feature enhancements and bug fixes.

The most notable feature enhancements include:
  • A resource can now be configured to publish automatically on an interval (See "Automated Publishing" section in User Manual)
  • The interface has been translated into Portuguese, making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English.
  • An IPT can be configured to back up each DwC-Archive version published (See "Archival Mode" in User Manual)
  • Each resource version now has a resolvable URL (See "Versioned Page" section in User Manual)
Filterable, pageable, and sortable resource overview table in v2.0.5
  • The order of columns in published DwC-Archives is always the same between versions
  • Style (CSS) customizations are easier than ever - check out this new guide entitled "How to Style Your IPT" for more information
  • Hundreds if not thousands of resources can be handled, now that the resource overview tables are filterable, pageable, and sortable (See "Public Resource Table" section in User Manual) 

The most important bug fixes are:
  • Garbled encoding on registration updates has been fixed
  • The problems uploading DwC-Archives in .gzip format has been fixed
  • The problem uploading a resource logo has been fixed
The new look in v2.0.5
The changes mentioned above represent just a fraction of the work that has gone into this version. Since version 2.0.4 was released 7 months ago, a total of 45 issues have been addressed. These are detailed in the issue tracking system.

It is great to see so much feedback from the community in the form of issues especially as the IPT becomes more stable and comprehensive over time. After all, the IPT is a community-driven project and anyone can contribute patches, translations, or have their say simply by adding or voting on issues. 

The single largest community contribution in this version has been the translation into Portuguese done by three volunteers at the Universidade de São Paulo, Research Center on Biodiversity and Computing: Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva. With Brazil recently joining the GBIF network, we hope the Portuguese interface for the IPT will help in publication of the wealth of biodiversity data available from Brazilian institutions.   

We’d also like to give special thanks to the other volunteers below:
  • Marie-Elise Lecoq (GBIF France) - Updating French translation
  • Yu-Huang Wang (TaiBIF, Taiwan) - Updating Traditional Chinese translation
  • Dairo Escobar, and Daniel Amariles (Colombian Biodiversity Information System (SiB)) - Updating Spanish translation
  • Carlos Cubillos (Colombian Biodiversity Information System (SiB)) - Contributing style improvements
  • Sijmen Cozijnsen (independent contractor working for NLBIF, Netherlands) - Contributing style improvements
On behalf of the GBIF development team, I hope you enjoy using the latest version of the IPT. 

Tuesday, 14 May 2013

Migrating our hadoop cluster from CDH3 to CDH4

We've written a number of times on the initial setup, eventual upgrade and continued tuning of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to CDH4.2.1. Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons:

  • general performance improvements in the entire Hadoop/HBase stack
  • continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem)
  • multi-threaded compactions (the need for which we concluded in this post)
  • table-based region balancing (rather than just cluster-wide)
We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our bus factor, and get the server management back into the hands of our ops team, we've moved to CDH Manager to control our CDH installation. That's been going pretty well so far, but, we're getting ahead of ourselves...

The Process

We have 6 slave nodes that have a lot of disk capacity since we spec'd with a goal of lots of spindles which meant we got lots of space "for free". Rather than upgrading in place, we decided to start fresh with new master & zookeeper nodes, and we calculated that we'd have enough space to pull half the slaves into the new cluster without losing any data. We cleaned up all the tmp files and anything we deemed not worth saving from HBase and hdfs, and started the migration:

Reduce the replication factor

We reduced the replication factor to 2 on the 6 slave nodes to reduce the disk use:

hadoop fs -setrep -R 2 /

Decommission the 3 nodes to move

"Decommissioning" is the civilized and safe way to remove nodes from a cluster where there's risk that they contain the only copies of some data in the cluster (they'll block writes but accept reads until all blocks have finished replicating out). To do it add the names of the target machines to an "excludes" file (one per line) that your hdfs config needs to reference, and then refresh hdfs.

The block in hdfs-site.xml:


then run:

bin/hadoop dfsadmin -refreshNodes

and wait for the "under replicated blocks" count on the hdfs admin page to drop to 0 and the decommissioning nodes to move into state "Decommissioned".

Don't forget HBase

The hdfs datanodes are tidied up now but don't forget to cleanly shutdown the HBase regionservers - run:


from within the HBase directory on the host you're shutting down (specifying the real name for HOSTNAME). It will shed its regions and shutdown when tidied up (more details here).

Now you can shutdown the tasktracker and datanode, and then the machine is ready to be wiped.

Build the new cluster

We wiped the 3 decommissioned slave nodes and installed the latest version of CentOS (our linux of choice, version 6.4 at time of writing). We also pulled 3 much lesser machines from our other cluster after decommissioning them in the same way, and installed CentOS 6.4 there, too. The 3 lesser machines would form our zookeeper ensemble and master nodes in the new cluster.

Enter CDH Manager

The folks at Cloudera have made a free version of their CDH Manager app available, and it makes managing a cluster much, much easier. After setting up the 6 machines that would form the basis of our new cluster with just the barebones OS, we were ready to start wielding the manager. We made a small VM to hold the manager app and installed it there. The manager instructions are pretty good, so I won't recreate them here. We had trouble with the key-based install so had to resort to setting identical passwords for root and allowing root ssh access for the duration of the install, but other than that it all went pretty smoothly. We installed in the following configuration (the master machines are the lesser ones described above, and the slaves the more powerful machines).

Machine and Role assignments
Machine Roles
master1 HDFS Primary NameNode, Zookeeper Member, HBase Master (secondary)
master2 HDFS Secondary NameNode, Zookeeper Member, HBase Master (primary)
master3 Hadoop JobTracker, Zookeeper Member, HBase Master (secondary)
slave1 HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave2 HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave3 HDFS DataNode, Hadoop TaskTracker, HBase Regionserver

Copy the data

Now we had two running clusters - our old CDH3u3 cluster (with half its machines removed) and the new, empty CDH 4.2.1 cluster. The trick was how to get data from the old cluster into the new, with our primary concern being the data in HBase. The builtin facility for this sort of thing is called CopyTable, and sounds great, except that it doesn't work across major versions of HBase, so that was out. Next we looked at copying the HFiles directly from the old cluster to the new using the HDFS builtin command distcp. Because we could handle shutting down HBase on the old cluster for the duration of the copy this, in theory, should work - newer versions of HBase can read the older versions' HFiles and then write the new versions during compactions (and by shutting down we don't run the risk of missing updates from caches that haven't flushed, etc). And in spite of lots of warnings around the net that it wouldn't work, we tried it anyway. And it didn't work :) We managed to get the -ROOT- table up but it couldn't find .META. and that's where our patience ended. The next, and thankfully successful, attempt was using HBase export, distcp, and HBase import.

On the old cluster we ran:

bin/hadoop jar hbase-0.90.4-cdh3u3.jar export table_name /exports/table_name

for each of our tables, which produced a bunch of sequence files in the old cluster's HDFS. Those we copied over to the new cluster using HDFS's distcp command:

bin/hadoop distcp hftp://old-cluster-namenode:50070/exports/table_name hdfs://master1:8020/imports/table_name

which takes advantage of the builtin http-like interface (hftp) that HDFS provides, which makes the copy process version agnostic.

Finally on the new cluster we can import the copied sequence files into the new HBase:

bin/hadoop jar hbase-0.94.2-cdh4.2.1-security.jar import table_name /imports/table_name

Make sure the table exists before you import, and because the import is a mapreduce job that does Puts, it would also be wise to presplit any large tables at creation time so that you don't crush your new cluster with lots of hot regions and splitting. Also one known issue in this version of HBase is a performance regression from version 0.92 to 0.94 (detailed in HBASE-7868), which you can workaround by adding the following to your table definition:


e.g. create 'test_table', {NAME=>'cf', COMPRESSION=>'SNAPPY', VERSIONS=>1, DATA_BLOCK_ENCODING => 'FAST_DIFF'}

As per that linked issue, you should also enable short-circuit reads from the CDH Manager interface.

And to complete the copying process, run major compactions on all your tables to ensure the best data locality you can for your regionservers.

All systems go

After running checks on the copied data, and updating our software to talk to CDH4, we were happy that our new cluster was behaving as expected. To get back to our normal performance levels we then shutdown the remaining machines in the CDH3u3 cluster, wiped and installed the latest OS, and then told CDH Manager to install on them. A few minutes later we had all our M/R slots back, as well as our regionservers. We ran the HBase balancer to evenly spread out the regions, ran another major compaction on our tables to force data-locality, and we were back in business!