This Hadoop HBase tutorial – A NoSQL DataBase covers what is HBase. Main components of HBase, Its characteristics like. It will also cover how to store Big data with HBase and prerequisites to set HBase cluster. This tutorial also describes Zookeeper and characteristics of zookeeper to further enhance your knowledge.
2. Introduction to HBase – A NoSQL DataBase
It is a non-relational column oriented distributed database that runs on top of HDFS. It is a NoSQL open source database which stores data in rows and columns. Cell is the intersection of rows and columns.
To track changes in the cell, versioning makes it possible to retrieve any version of contents. Versioning makes difference between HBase tables and RDBMS (Relational DataBase Management Syatem).
Each cell value includes a “version” attribute, which is nothing more than a timestamp identifying the cell. Each value in the map is an uninterrupted array of bytes.
The map is indexed by a row key, column key, and a timestamp. It’s Implementation are scalable, sparse, persistent, and multidimensional-sorted maps.
3. Need for HBase
As per Gartner analyst Merv Adrian, “Anyone who wants to keep data within an HDFS environment and wants to do anything other than brute-force reading of the entire file system [with MapReduce] needs to look at HBase. For random access, you need to have HBase.”
It allows fast random reads and writes that cannot be handled by Hadoop. Also relational databases cannot handle variety of data that is growing. In this manner, it finds huge application in industry.
4. Components of HBase Architecture
The main components of HBase are:
- It’s HMaster handles negotiating load balancing across all Region Servers and maintain the state of the Hadoop cluster. It is not part of the actual data storage or retrieval path.
- RegionServer are nodes that are deployed on each machine and hosts data and processes I/O requests.
- ZooKeeper is an open-source server which enables reliable distributed coordination. It is a centralized service that maintains configuration information, naming, providing distributed synchronization, and providing group services. Hence, It is not operational without ZooKeeper.
5. Characteristics of HBase
It has many features due to which it is very popular amoung industries. Goibibo uses it for customer profiling while Facebook Messenger uses it’s architecture. Many other companies like NetFlix, LinkedIn, Sears, Yahoo!, Flurry, and Adobe Explorys etc. also uses HBase in production. Its important features are:
It also provides consistent reads and writes. Hence, we can use it for high-speed requirements if we do not need RDBMS-supported features such as full transaction support or typed columns.
Since, supporting file system distributes the data, It offers transparent, automatic splitting, and redistribution of its content.
c) High Availability
It also supports LAN and WAN failover and recovery. At the core, there is a master server, which handles monitoring the region servers and all metadata for the cluster.
d) Client API
It also offers programmatic access through Java APIs
6. Storing Big Data with HBase
It is a NoSQL database which does not understand structured query and also well suited for sparse data set. It also updates efficiently for sequential or batch write, update, or delete. Also stores data in-memory and has low latency.
Hence, HBase is useful when large amounts of information need to be stored, updated, and processed often at high speed. For examples lookups and large scans of the data.
It can also stores reference data (demographic data, IP address geolocation lookup tables, and product dimensional data), thus make it available for Hadoop tasks.
7. HBase Cluster Prerequisites
To run this, an Amazon EMR cluster should meet certain requirements:
- It runs only on persistent clusters, which are created by the Amazon EMR command line interface (CLI) and Amazon EMR console.
- An Amazon EC2 key pair must be set at the time of creating the HBase cluster. It’s shell needs the Secure Shell (SSH) network protocol to connect with the master node.
- It’s clusters are supported on the beta version AMI and Hadoop 20.205 or later.
- The CLI and Amazon EMR console automatically set the correct AMI on it’s clusters.
- It supports the following instance types: m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, cc2.8xlarge, hi1.4xlarge, or hs1.8xlarge.