Friday, 8 June 2012

Faster HBase - hardware matters

As I've written earlier, I've been spending some time evaluating the performance of HBase using PerformanceEvaluation. My earlier conclusions amounted to: bond your network ports and get more disks. So I'm happy to report that we got more disks, in the form of 6 new machines that together make up our new cluster:

Master (c1n4): HDFS NameNode, Hadoop JobTracker, HBase Master, and Zookeeper
Zookeeper (c1n1): Zookeeper for this cluster, master for our other cluster

Slaves (c4n1..c4n6): HDFS DataNode, Hadoop TaskTracker, HBase RegionServer (6 GB heap)

c1n*: 1x Intel Xeon X3363 @ 2.83GHz (quad), 8GB RAM, 2x500G SATA 5.4K
c4n*: Dell R720XD, 2x Intel Xeon E5-2640 @ 2.50GHz (6-core), 64GB RAM, 12x1TB SATA 7.2K

Obviously the new machines come with faster everything and lots more RAM, so first I bonded two ethernet ports and then ran the tests again to see how much we had improved:
Figure 1: Scan performance of new cluster (2x 1gig ethernet)
So, 1 million records/second? Yes, please! That's a performance increase of about 300% over the c2 cluster I tested in my previous posts. Given that we doubled the number of machines and quadrupled the number of drives, that improvement feels about right. But the decline in performance as number of mappers is increased is still a bit suspect - we'd hope that performance would go up with more workers. This is the same behaviour we saw when we were limited by network in our original tests, and ganglia again shows a similar pattern, though this time it looks like we're hitting a limit around the 2 gig ethernet bond:

Figure 2: bytes_in (MB/s) with dual gigabit ethernet
Figure 3: bytes_out (MB/s) with dual gigabit ethernet

Unfortunately our master switch is currently full, so we don't have the extra 6 ports needed to test a triple bond - but given our past experience I feel reasonably confident that it would change Figure 1 such that scan performance increases with number of mappers up to some disk I/O contention limit.

Hang-on, what about data locality?

But there's a bigger question here - why are we using so much network bandwidth in the first place? Why stress about major compactions and data locality when it doesn't seem to get used? Therein lies the rub - PerformanceEvaluation can't take advantage of data locality. Tim wrote about the tremendous importance of TableInputFormat in ensuring maximum scan performance from MapReduce, and PerformanceEvaluation doesn't do that. It assigns a block of ids to scan to different mappers at random, meaning that at best one in six mappers (in our setup) will coincidentally have local data to read, and the rest will all transfer their data across the network. This isn't a bug in PerformanceEvaluation, per se, because it was written to try and emulate the tests that Google ran in their seminal white paper on BigTable, rather than act as a true benchmark for scanning performance. But if you're new to this stuff (as I was) it sure can be confusing. When we switched to scanning our real data using TableInputFormat our throughput jumped to 2M/sec from the 1M/sec we got using PerformanceEvaluation.


While we learned a lot from using PerformanceEvaluation to test our clusters, and it helped to uncover any misconfigurations and taught us how to fine tune lots of parameters, it is not a good tool for benchmarking scan performance. As Tim wrote, scans across our real occurrence data (~370M records) using TableInputFormat are finishing in 3 minutes - for our needs that is excellent and means we're happy with our cluster upgrade. Stay tuned for news about the occurrence download service that Lars and I are writing to take advantage of all this speed :)


  1. Hi,

    Considering how your region servers have 64GB of RAM, why only 6GB of heap allocated to the RegionServers?

    In addition, if you don't mind me asking, is it possible if you detailed out the heap distribution for both master and slave?


    1. The machines that host the RegionServers are also hosting and HDFS datanode and lots of Hadoop map and reduce slots. As it happens, in my testing the size of the heap didn't really matter for read performance. In a more recent blog post I talk about our quest to improve write performance, and there heap is paramount. I talk more about heap settings there:

  2. I would be curious if that performanc holds when your node has 4 TB of data on it. Cassandra I know recommends only having 1TB of data storage per node as compactions start taking way too long moving data around. I wonder if hbase has any issues like this as all the data is sorted as well, right? where cassandra data is not sorted between nodes at all but rather is random though on disk.