HBase Performance Tuning | Ways For HBase Optimization

Boost your career with Free Big Data Courses!!

Today, in this HBase article, “HBase Performance Tuning”  we are discussing some best ways for optimizing our HBase environment. We will see garbage collection tuning, compression in HBase and configurations for HBase. Moreover, we will apply a load test for HBase Performance Tuning.

Since HBase is a key part of the Hadoop architecture and a distributed database hence we definitely want to optimize HBase Performance as much as possible. Also, we will look at HBase scan performance tuning and HBase read optimizations.

So, let’s explore HBase Performance Tuning.

What is Garbage Collection Tuning?

Garbage Collection Parameter is one of the lower-level settings we need to adjust for the region server processes. Although make sure, the master is not a problem here as data does not pass through it and it does not handle any heavy loads either.

However, only to the HBase Region Servers, we need to add these Garbage Collection Parameters for HBase Performance Tuning.

Memstore-Local Allocation Buffer

In order to mitigate the issue of heap fragmentation due to too much churn on the memstore instances of an HBase Region Server, version 0.90 of HBase introduced an advanced mechanism, the Memstore-Local Allocation Buffers(MSLAB).

Basically, these MSLABs are buffers of fixed sizes which consist of KeyValue instances of varying sizes. There are times when a buffer cannot completely fit a newly added KeyValue, at that time it is considered full and then once again a new buffer is created for the given fixed size.

HBase Compression 

There is one more feature of HBase, that it support for a number of compression algorithms in HBase. Basically, HBase compression algorithms can be enabled at the column family level.

HBase Performance Tuning

HBase Compression

In addition, compression yields better performance, for every other use case, it is possible because there is CPU which is performing the compression and decompression, its overhead is less than the actual demand to read more data from the disk.

i. Available HBase Codecs

There is a fixed list of supported compression algorithms in HBase, we can select from it. Although, when it comes to compression ratio, as well as CPU and installation requirements, they have different qualities.

ii. Verifying Installation

It is highly recommended that you check if the installation was successful, as soon as we have installed a supported HBase compression algorithm. So, to do that, there are several mechanisms in HBase.

  • HBase Compression test tool

In order to test if compression is set up properly or not, there is a tool available in HBase. Hence, to use it,  run the following command:

./bin/ hbase org.apache.hadoop.hbase.util.CompressionTest,

Thus, it returns the information on way to run the tool:

$ ./bin/hbase org.apache.hadoop.hbase.util.CompressionTest
Usage: CompressionTest <path> none|gz|lzo|snappy

For example:

hbase class org.apache.hadoop.hbase.util.CompressionTest file:///tmp/testfile gz

iii. Enabling Compression

The installation of the JNI and native compression libraries is must for Enabling compression.

hbase(main):001:0> create 'testtable', { NAME => 'colfam1', COMPRESSION => 'GZ' }
0 row(s) in 1.1920 seconds
hbase(main):012:0> describe 'testtable'
DESCRIPTION ENABLED
{NAME => 'testtable', FAMILIES => [{NAME => 'colfam1', true
BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS
=> '3', COMPRESSION => 'GZ', TTL => '2147483647', BLOCKSIZE
=> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0400 seconds

In order to read back the schema of the newly created table, we use the describe HBase shell command. Here, we can see that the compression is set to GZIP. Moreover, we use the alter command for existing tables to enable—or change or disable—the compression algorithm.

Also, disable the compression for the given column family to change the compression format to NONE.

  • Load Balancing

There is one built-in feature in Master, what we call the balancer. Basically, the balancer runs every five minutes, by default. And, by the hbase.balancer.period property, we configure it.

Its process is like, as soon as it starts, it strives to equal out the number of assigned regions per region server hence they are within one region of the average number per server. Basically, the call first determines a new assignment plan.

So, that explains which regions should be moved where. Then by calling the unassign() method of the administrative API iteratively, it starts the process of moving the regions.

Also, there is an upper limit in the balancer, which decides how long it is allowed to run. Basically, by using the hbase.balancer.max.balancing property, it is configured or defaults to half of the balancer period vale, or two and a half minutes.

  • Merging Regions

Sometimes we may need to merge regions since it is much more common for regions to split automatically over time as we are adding data to the corresponding table.

Let’s understand with an example, let we want to reduce the number of regions hosted by each server after we have removed a large amount of data, so there is a tool in HBase which permits us to merge two adjacent regions as long as the cluster is not online.

Therefore, below is a command-line tool we can use to get the usage details:

$ ./bin/hbase org.apache.hadoop.hbase.util.Merge
Usage: bin/hbase merge <table-name> <region-1> <region-2>
  • Client API: Best Practices

There are a handful of optimizations we should consider to gain the best performance while reading or writing data from a client using the API. 

  • Disable auto-flush

By using the setAutoFlush(false) method, set the auto-flush feature ofHTable to false while performing a lot of put operations.

  • Limit scan scope

It says, be aware of which attributes we are selecting when we use scan to process large numbers of rows.

  • Close ResultScanners

This may not help in improving performance, but definitely helps rather avoiding performance problems.

  • Block cache usage

Furthermore, by the setCacheBlocks() method, we can set Scan instances to use the block cache in the region server.

  1. Optimal loading of row keys
  2. Turn off WAL on Puts

HBase Configuration 

In order to fine-tune our HBase Cluster setup, there are many configuration properties are available in HBase:

  1. Decrease ZooKeeper timeout.
  2. Increase handlers.
  3. Increase heap settings.
  4. Enable data compression.
  5. Increase region size.
  6. Adjust block cache size.
  7. Adjust memstore limits.
  8. Increase blocking store files.
  9. Increase block multiplier.
  10. Decrease maximum logfiles.

Load Tests in HBase Performance Tuning 

It is recommended to run HBase performance tests to verify the functionality of cluster, after installing our cluster. Moreover, it provides us a baseline which we can refer to making changes to the configuration of the cluster, or over the schemas of our tables.

Basically, doing a burn-in of our cluster will show us how much we can gain from it, but make sure this does not replace a test with the load as expected from our use case.

i. HBase Performance Evaluation

To execute a performance evaluation, HBase ships with its own tool. This is what we call Performance Evaluation (PE). Basically, on using it with no command-line parameters, we can gain its main usage details:

$./bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation
Usage: java org.apache.hadoop.hbase.PerformanceEvaluation \
[--miniCluster] [--nomapred] [--rows=ROWS] <command> <nclients>

Moreover, to run a single evaluation client:

$ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 1

ii. YCSB (Yahoo! Cloud Serving Benchmark*)

Basically, to run comparable workloads against different storage systems, we can use the YCSB is a suite of tools. However, it is also a reasonable tool for performing an HBase cluster burn-in—or performance text, while primarily built to compare various systems.

  • YCSB installation

We need to compile a binary version yourself since YCSB is available in an online repository only. So, at very first, clone the repository:

$ git clone http://github.com/brianfrankcooper/YCSB.git

Start empty Git repository in /private/tmp/YCSB/.git/

Resolving deltas: 100% (475/475), done.
Moreover, it creates a local YCSB directory in our current path. However, its next step is to change into the newly created directory or compile the executable code and to copy the required libraries for HBase :

$ cd YCSB/
$ cp $HBASE_HOME/hbase*.jar db/hbase/lib/
$ cp $HBASE_HOME/lib/*.jar db/hbase/lib/
$ ant
Buildfile: /private/tmp/YCSB/build.xml
...
makejar:
[jar] Building jar: /private/tmp/YCSB/build/ycsb.jar
BUILD SUCCESSFUL
Total time: 1 second
$ ant dbcompile-hbase
...

BUILD SUCCESSFUL
Total time: 1 second
It leaves us with an executable JAR file in the build directory.

Furthermore, it can still be useful to test a varying set of loads on your cluster, even though YCSB can hardly emulate the workload. However, to emulate cases that are bound to read, write, or both kinds of operations, use the supplied workloads, or create your own.

So, this was all about HBase Performance Tuning. Hope you like our explanation.

Conclusion 

Hence, in this HBase Performance Tuning tutorial,  we saw all the best practices for Optimizing HBase Performance of our HBase environment. Moreover, we discussed garbage collection tuning, HBase scan performance tuning, HBase read performance.

Also, we applied load test for HBase Performance Tuning. Still, if any doubt occurs regarding performance tuning of HBase, feel free to ask.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google

follow dataflair on YouTube

Leave a Reply

Your email address will not be published. Required fields are marked *