HBase Tutorial – The Beginners Guide 1

1. Objective

This HBase tutorial – A NoSQL DataBase covers concepts like what is HBase, main components of HBase, Hbase characteristics like consistency, sharding, high availability and client API, how to store Big data with HBase and prerequisites to set HBase cluster. This HBase tutorial also covers Zookeeper and characteristics of zookeeper to further enhance your HBase knowledge.

Hbase tutorial

2. Introduction to HBase Tutorial – A NoSQL DataBase

HBase is a non-relational column oriented distributed database that runs on top of HDFS. It is a NoSQL open source database in which data is stored in rows and columns. Cell is the intersection of rows and columns.

To track changes in the cell, versioning makes it possible to retrieve any version of contents. Versioning makes difference between HBase tables and RDBMS.

Each cell value includes a “version” attribute, which is nothing more than a timestamp uniquely identifying the cell. Each value in the map is an uninterrupted array of bytes.

The map is indexed by a row key, column key, and a timestamp. Implementations of HBase are highly scalable, sparse, distributed, persistent, and multidimensional-sorted maps.

3. Need for HBase

As per Gartner analyst Merv Adrian, “Anyone who wants to keep data within an HDFS environment and wants to do anything other than brute-force reading of the entire file system [with MapReduce] needs to look at HBase. For random access, you need to have HBase.”

Hbase allows fast random reads and writes that cannot be handled by Hadoop. Also relational databases cannot handle variety of data that is growing exponentially. In this manner Hbase finds huge application in industry.

4. Components of HBase architecture

The main components of HBase are as described below:

  • HBase HMaster is responsible for negotiating load balancing across all Region Servers and maintain the state of the Hadoop cluster. It is not part of the actual data storage or retrieval path.
  • RegionServer are nodes that are deployed on each machine and hosts data and processes I/O requests.
  • ZooKeeper is an open-source server which enables highly reliable distributed coordination. It is a centralized service that is used to maintain configuration information, naming, providing distributed synchronization, and providing group services. HBase is not operational without ZooKeeper.

5. Characteristics of HBase

There are many features of HBase due to which it is widely used by industries. Goibibo uses HBase for customer profiling while Facebook Messenger uses HBase architecture. Many other companies like NetFlix, LinkedIn, Sears, Yahoo!, Flurry, Adobe Explorys etc. use HBase in production. Below are mentioned important features of HBase:

a. Consistency

Strongly consistent reads and writes are provided by HBase. This means we can use it for high-speed requirements if we do not need RDBMS-supported features such as full transaction support or typed columns.

b. Sharding

As the data is distributed by the supporting file system, HBase offers transparent, automatic splitting, and redistribution of its content.

c. High Availability

HBase supports LAN and WAN failover and recovery. At the core, there is a master server, which is responsible for monitoring the region servers and all metadata for the cluster.

d. Client API

HBase offers programmatic access through Java APIs

6. Storing Big Data with Hbase

HBase is a NoSQL database which does not understand structured query. It is well suited for sparse data set. It updates efficiently for sequential or batch write, update, or delete. It stores data in-memory and has low latency.

HBase is useful when large amounts of information need to be stored, updated, and processed often at high speed. Examples are lookups and large scans of the data.

It can store reference data (demographic data, IP address geolocation lookup tables, and product dimensional data) and make it available for Hadoop tasks.

7. HBase Cluster Prerequisites

To run HBase, an Amazon EMR cluster should meet certain requirements:

  • HBase runs only on persistent clusters, which are created automatically by the Amazon EMR command line interface (CLI) and Amazon EMR console.
  • An Amazon EC2 key pair must be set at the time of creating the HBase cluster. HBase shell needs the Secure Shell (SSH) network protocol to connect with the master node.
  • HBase clusters are supported on the beta version AMI and Hadoop 20.205 or later.
  • The CLI and Amazon EMR console automatically set the correct AMI on HBase clusters.
  • HBase is only supported on the following instance types: m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, cc2.8xlarge, hi1.4xlarge, or hs1.8xlarge.



Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “HBase Tutorial – The Beginners Guide