This HBase tutorial – A NoSQL DataBase covers concepts like what is HBase, main components of HBase, Hbase characteristics like consistency, sharding, high availability and client API, how to store Big data with HBase and prerequisites to set HBase cluster. This HBase tutorial also covers Zookeeper and characteristics of zookeeper to further enhance your HBase knowledge.
2. Introduction to HBase Tutorial – A NoSQL DataBase
HBase is a non-relational column oriented distributed database that runs on top of HDFS. It is a NoSQL open source database in which data is stored in rows and columns. Cell is the intersection of rows and columns.
To track changes in the cell, versioning makes it possible to retrieve any version of contents. Versioning makes difference between HBase tables and RDBMS.
Each cell value includes a “version” attribute, which is nothing more than a timestamp uniquely identifying the cell. Each value in the map is an uninterrupted array of bytes.
The map is indexed by a row key, column key, and a timestamp. Implementations of HBase are highly scalable, sparse, distributed, persistent, and multidimensional-sorted maps.
3. Need for HBase
As per Gartner analyst Merv Adrian, “Anyone who wants to keep data within an HDFS environment and wants to do anything other than brute-force reading of the entire file system [with MapReduce] needs to look at HBase. For random access, you need to have HBase.”
Hbase allows fast random reads and writes that cannot be handled by Hadoop. Also relational databases cannot handle variety of data that is growing exponentially. In this manner Hbase finds huge application in industry.
4. Components of HBase architecture
The main components of HBase are as described below:
- HBase HMaster is responsible for negotiating load balancing across all Region Servers and maintain the state of the Hadoop cluster. It is not part of the actual data storage or retrieval path.
- RegionServer are nodes that are deployed on each machine and hosts data and processes I/O requests.
- ZooKeeper is an open-source server which enables highly reliable distributed coordination. It is a centralized service that is used to maintain configuration information, naming, providing distributed synchronization, and providing group services. HBase is not operational without ZooKeeper.
5. Characteristics of HBase
There are many features of HBase due to which it is widely used by industries. Goibibo uses HBase for customer profiling while Facebook Messenger uses HBase architecture. Many other companies like NetFlix, LinkedIn, Sears, Yahoo!, Flurry, Adobe Explorys etc. use HBase in production. Below are mentioned important features of HBase:
Strongly consistent reads and writes are provided by HBase. This means we can use it for high-speed requirements if we do not need RDBMS-supported features such as full transaction support or typed columns.
As the data is distributed by the supporting file system, HBase offers transparent, automatic splitting, and redistribution of its content.
c. High Availability
HBase supports LAN and WAN failover and recovery. At the core, there is a master server, which is responsible for monitoring the region servers and all metadata for the cluster.
d. Client API
HBase offers programmatic access through Java APIs
6. Storing Big Data with Hbase
HBase is a NoSQL database which does not understand structured query. It is well suited for sparse data set. It updates efficiently for sequential or batch write, update, or delete. It stores data in-memory and has low latency.
HBase is useful when large amounts of information need to be stored, updated, and processed often at high speed. Examples are lookups and large scans of the data.
It can store reference data (demographic data, IP address geolocation lookup tables, and product dimensional data) and make it available for Hadoop tasks.
7. HBase Cluster Prerequisites
To run HBase, an Amazon EMR cluster should meet certain requirements:
- HBase runs only on persistent clusters, which are created automatically by the Amazon EMR command line interface (CLI) and Amazon EMR console.
- An Amazon EC2 key pair must be set at the time of creating the HBase cluster. HBase shell needs the Secure Shell (SSH) network protocol to connect with the master node.
- HBase clusters are supported on the beta version AMI and Hadoop 20.205 or later.
- The CLI and Amazon EMR console automatically set the correct AMI on HBase clusters.
- HBase is only supported on the following instance types: m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, cc2.8xlarge, hi1.4xlarge, or hs1.8xlarge.