HBase Performance Tuning | Ways For HBase Optimization
Today, in this HBase article, “HBase Performance Tuning” we are discussing some best ways for optimizing our HBase environment. We will see garbage collection tuning, compression in HBase and configurations for HBase. Moreover, we will apply a load test for HBase Performance Tuning.
Since HBase is a key part of the Hadoop architecture and a distributed database hence we definitely want to optimize HBase Performance as much as possible. Also, we will look at HBase scan performance tuning and HBase read optimizations.
So, let’s explore HBase Performance Tuning.
What is Garbage Collection Tuning?
Garbage Collection Parameter is one of the lower-level settings we need to adjust for the region server processes. Although make sure, the master is not a problem here as data does not pass through it and it does not handle any heavy loads either.
However, only to the HBase Region Servers, we need to add these Garbage Collection Parameters for HBase Performance Tuning.
Memstore-Local Allocation Buffer
In order to mitigate the issue of heap fragmentation due to too much churn on the memstore instances of an HBase Region Server, version 0.90 of HBase introduced an advanced mechanism, the Memstore-Local Allocation Buffers(MSLAB).
Basically, these MSLABs are buffers of fixed sizes which consist of KeyValue instances of varying sizes. There are times when a buffer cannot completely fit a newly added KeyValue, at that time it is considered full and then once again a new buffer is created for the given fixed size.
HBase CompressionÂ
There is one more feature of HBase, that it support for a number of compression algorithms in HBase. Basically, HBase compression algorithms can be enabled at the column family level.
In addition, compression yields better performance, for every other use case, it is possible because there is CPU which is performing the compression and decompression, its overhead is less than the actual demand to read more data from the disk.
i. Available HBase Codecs
There is a fixed list of supported compression algorithms in HBase, we can select from it. Although, when it comes to compression ratio, as well as CPU and installation requirements, they have different qualities.
ii. Verifying Installation
It is highly recommended that you check if the installation was successful, as soon as we have installed a supported HBase compression algorithm. So, to do that, there are several mechanisms in HBase.
- HBase Compression test tool
In order to test if compression is set up properly or not, there is a tool available in HBase. Hence, to use it, run the following command:
./bin/ hbase org.apache.hadoop.hbase.util.CompressionTest,
Thus, it returns the information on way to run the tool:
$ ./bin/hbase org.apache.hadoop.hbase.util.CompressionTest Usage: CompressionTest <path> none|gz|lzo|snappy
For example:
hbase class org.apache.hadoop.hbase.util.CompressionTest file:///tmp/testfile gz
iii. Enabling Compression
The installation of the JNI and native compression libraries is must for Enabling compression.
hbase(main):001:0> create 'testtable', { NAME => 'colfam1', COMPRESSION => 'GZ' } 0 row(s) in 1.1920 seconds hbase(main):012:0> describe 'testtable' DESCRIPTION ENABLED {NAME => 'testtable', FAMILIES => [{NAME => 'colfam1', true BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'GZ', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0400 seconds
In order to read back the schema of the newly created table, we use the describe HBase shell command. Here, we can see that the compression is set to GZIP. Moreover, we use the alter command for existing tables to enable—or change or disable—the compression algorithm.
Also, disable the compression for the given column family to change the compression format to NONE.
- Load Balancing
There is one built-in feature in Master, what we call the balancer. Basically, the balancer runs every five minutes, by default. And, by the hbase.balancer.period property, we configure it.
Its process is like, as soon as it starts, it strives to equal out the number of assigned regions per region server hence they are within one region of the average number per server. Basically, the call first determines a new assignment plan.
So, that explains which regions should be moved where. Then by calling the unassign() method of the administrative API iteratively, it starts the process of moving the regions.
Also, there is an upper limit in the balancer, which decides how long it is allowed to run. Basically, by using the hbase.balancer.max.balancing property, it is configured or defaults to half of the balancer period vale, or two and a half minutes.
- Merging Regions
Sometimes we may need to merge regions since it is much more common for regions to split automatically over time as we are adding data to the corresponding table.
Let’s understand with an example, let we want to reduce the number of regions hosted by each server after we have removed a large amount of data, so there is a tool in HBase which permits us to merge two adjacent regions as long as the cluster is not online.
Therefore, below is a command-line tool we can use to get the usage details:
$ ./bin/hbase org.apache.hadoop.hbase.util.Merge Usage: bin/hbase merge <table-name> <region-1> <region-2>
- Client API: Best Practices
There are a handful of optimizations we should consider to gain the best performance while reading or writing data from a client using the API.Â
- Disable auto-flush
By using the setAutoFlush(false) method, set the auto-flush feature ofHTable to false while performing a lot of put operations.
- Limit scan scope
It says, be aware of which attributes we are selecting when we use scan to process large numbers of rows.
- Close ResultScanners
This may not help in improving performance, but definitely helps rather avoiding performance problems.
- Block cache usage
Furthermore, by the setCacheBlocks() method, we can set Scan instances to use the block cache in the region server.
- Optimal loading of row keys
- Turn off WAL on Puts
HBase ConfigurationÂ
In order to fine-tune our HBase Cluster setup, there are many configuration properties are available in HBase:
- Decrease ZooKeeper timeout.
- Increase handlers.
- Increase heap settings.
- Enable data compression.
- Increase region size.
- Adjust block cache size.
- Adjust memstore limits.
- Increase blocking store files.
- Increase block multiplier.
- Decrease maximum logfiles.
Load Tests in HBase Performance TuningÂ
It is recommended to run HBase performance tests to verify the functionality of cluster, after installing our cluster. Moreover, it provides us a baseline which we can refer to making changes to the configuration of the cluster, or over the schemas of our tables.
Basically, doing a burn-in of our cluster will show us how much we can gain from it, but make sure this does not replace a test with the load as expected from our use case.
i. HBase Performance Evaluation
To execute a performance evaluation, HBase ships with its own tool. This is what we call Performance Evaluation (PE). Basically, on using it with no command-line parameters, we can gain its main usage details:
$./bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation Usage: java org.apache.hadoop.hbase.PerformanceEvaluation \ [--miniCluster] [--nomapred] [--rows=ROWS] <command> <nclients>
Moreover, to run a single evaluation client:
$ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 1
ii. YCSB (Yahoo! Cloud Serving Benchmark*)
Basically, to run comparable workloads against different storage systems, we can use the YCSB is a suite of tools. However, it is also a reasonable tool for performing an HBase cluster burn-in—or performance text, while primarily built to compare various systems.
- YCSB installation
We need to compile a binary version yourself since YCSB is available in an online repository only. So, at very first, clone the repository:
$ git clone http://github.com/brianfrankcooper/YCSB.git
Start empty Git repository in /private/tmp/YCSB/.git/
…
Resolving deltas: 100% (475/475), done.
Moreover, it creates a local YCSB directory in our current path. However, its next step is to change into the newly created directory or compile the executable code and to copy the required libraries for HBase :
$ cd YCSB/ $ cp $HBASE_HOME/hbase*.jar db/hbase/lib/ $ cp $HBASE_HOME/lib/*.jar db/hbase/lib/ $ ant Buildfile: /private/tmp/YCSB/build.xml ... makejar: [jar] Building jar: /private/tmp/YCSB/build/ycsb.jar BUILD SUCCESSFUL Total time: 1 second $ ant dbcompile-hbase ...
BUILD SUCCESSFUL
Total time: 1 second
It leaves us with an executable JAR file in the build directory.
Furthermore, it can still be useful to test a varying set of loads on your cluster, even though YCSB can hardly emulate the workload. However, to emulate cases that are bound to read, write, or both kinds of operations, use the supplied workloads, or create your own.
So, this was all about HBase Performance Tuning. Hope you like our explanation.
ConclusionÂ
Hence, in this HBase Performance Tuning tutorial, we saw all the best practices for Optimizing HBase Performance of our HBase environment. Moreover, we discussed garbage collection tuning, HBase scan performance tuning, HBase read performance.
Also, we applied load test for HBase Performance Tuning. Still, if any doubt occurs regarding performance tuning of HBase, feel free to ask.
Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google