Learn Apache HCatalog Reader Writer With Example


1. Objective

In our last HCatalog tutorial, we discussed the input-output interface. Today, we will learn HCatalog Reader Writer. In this HCatalog blog, we will learn how HCatalog manages for parallel input and output even without using MapReduce. So, it offers a way to read data from a Hadoop cluster or to write data into a Hadoop cluster. 

So, let’s start HCatalog Reder Writer.

HCatalog Reader Writer

Learn Apache HCatalog Reader Writer With Example

2. What is HCatalog Reader Writer?

Basically, a data transfer API for parallel input and output without using MapReduce is offered by HCatalog. By using a basic storage abstraction of tables and rows, this API offers a way to read data from a Hadoop cluster or to write data into a Hadoop cluster.

There are 3 essentials classes in data transfer API in HCatalog Reader Writer, such as:

  • HCatReader – That helps to read data from a Hadoop cluster.
  • HCatWriter – Whereas to write data into a Hadoop cluster, we use this class.
  • DataTransferFactory – Moreover, in order to generate reader and writer instances, we use this class.

Also, there are some auxiliary classes in the data transfer API in HCatalog Writer Reader, like:

  • ReadEntity
  • ReaderContext
  • WriteEntity
  • WriterContext

In addition, to facilitate the integration of external systems with Hadoop, the HCatalog data transfer API is designed.

Although, make sure HCatalog is not thread-safe.

Learn HCatalog Command Line Interface(CLI)

3. HCatReader

Basically, reading is a two-step process. So, on the master node of an external system, Its first step occurs. Whereas in parallel on multiple slave nodes its second step is done.

However, on a “ReadEntity”, Reads are done. Hence, we need to define a ReadEntity from which to read, before we start to read. It is possible by ReadEntity.Builder. We can easily specify a database name, table name, partition, as well as a filter string.

Let’s discuss HCatalog and Pig Integration

Example of HCatalog Reader:

ReadEntity.Builder builder = new ReadEntity.Builder();
ReadEntity entity = builder.withDatabase("mydatabase").withTable("mytable").build();

Above code explains a ReadEntity object (“entity”), where the name of the table is “mytable” in a database of name “mydatabase”. So, we can use it to read all the rows in this table. Although, make sure to the start of this operation this table must exist in HCatalog prior.

Further, using the ReadEntity and cluster configuration, we obtain an instance of HCatReader, after defining a ReadEntity:

HCatReader reader = DataTransferFactory.getHCatReader(entity, config);

Then to obtain a ReaderContext from a reader,  follow this:

ReaderContext cntxt = reader.prepareRead();

However, on the master node, only all of the above steps occur. Afterward, master node serializes this ReaderContext object. Then sends it to all the slave nodes. Hence, to read data, Slave nodes use this reader context.

Let’s discuss HCatalog Loader and Storer

for(InputSplit split : readCntxt.getSplits()){
HCatReader reader = DataTransferFactory.getHCatReader(split,
readerCntxt.getConf());
      Iterator<HCatRecord> itr = reader.read();
      while(itr.hasNext()){
             HCatRecord read = itr.next();
         }
}

4. HCatWriter

Similarly, it is also a two-step process where the first step occurs on the master node. And, the second one occurs in parallel on slave nodes.

However, on a “WriteEntity”, Writes are done that we can construct in a fashion similar to reads

WriteEntity.Builder builder = new WriteEntity.Builder();
WriteEntity entity = builder.withDatabase("mydatabase").withTable("mytable").build();

So, by above code, we create a WriteEntity object (“entity”) which we actually use to write into a table of name  “mytable” in the database, of name “mydatabase”.

The next step is to obtain a WriterContext, after creating a WriteEntity:

Let’s explore HCatalog CLI Commands

HCatWriter writer = DataTransferFactory.getHCatWriter(entity, config);
WriterContext info = writer.prepareWrite();

So, on the master node, all of the above steps occur. Afterward, master node serializes the WriterContext object and then it makes it available to all the slaves.

Further, using WriterContext,  we need to obtain an HCatWriter, on slave nodes, like:

HCatWriter writer = DataTransferFactory.getHCatWriter(context);

Also,  for the write method, the writer takes an iterator as the argument:

writer.write(hCatRecordItr);

The writer calls getNext() on this iterator in a loop. That writes out all the records attached to the iterator.

Let’s revise HCatalog Applications

So, this was all in HCatalog Reader Writer. Hope you like our explanation.

5. Conclusion

Hence, we have learned the whole concept of HCatalog Reader Writer. Also, we discussed the HCatalog Reader Example and HCatalog Writer Example. So, if any doubt occurs regarding HCatalog Reader Writer, feel free to ask in the comment section.

See also –

HCatalog Interview Questions

For reference