HCatalog and Pig Integration | Accessing Pig With HCatalog

Boost your career with Free Big Data Courses!!

In our last HCatalog tutorial, we discussed HCatalog loader and storer. Today, we will see HCatalog and Pig Integration. We can easily integrate HCatalog with Pig.

Moreover, we will also see the example of HCatalog and Pig Integration to understand it well.

So, let’s start HCatalog and Pig Integration.

Running Pig with HCatalog

Generally, it is not possible for Pig to pick up HCatalog jars. So, either we can use a flag in the pig command or we can set the environment variables PIG_CLASSPATH and PIG_OPTS,  to bring in the necessary jars, such as:

a. The -useHCatalog Flag

Hence, for working with HCatalog, simply include the following flag, to bring in the appropriate jars:
pig -useHCatalog

b. Jars and Configuration Files

Make sure we need to tell Pig where to find our HCatalog jars and the Hive jars used by the HCatalog client, for Pig commands that omit -useHCatalog. Hence, we need to define the environment variable PIG_CLASSPATH with the appropriate jars, to do this.

In addition, HCatalog can tell us the jars it needs. Though, it needs to know where Hadoop and Hive are installed, for that. Also, in the PIG_OPTS variable, we need to tell Pig the URI for our metastore.
Further, we can perform following in the case where we have installed Hadoop and Hive via tar:

export HADOOP_HOME=<path_to_hadoop_install>
export HIVE_HOME=<path_to_hive_install>
export HCAT_HOME=<path_to_hcat_install>
export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-core*.jar:\
$HCAT_HOME/share/hcatalog/hcatalog-pig-adapter*.jar:\
$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:\
$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:\
$HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:\
$HIVE_HOME/lib/slf4j-api-*.jar
export PIG_OPTS=-Dhive.metastore.uris=thrift://<hostname>:<port>

Also, we can pass the jars in your command line:

<path_to_pig_install>/bin/pig -Dpig.additional.jars=\
$HCAT_HOME/share/hcatalog/hcatalog-core*.jar:\
$HCAT_HOME/share/hcatalog/hcatalog-pig-adapter*.jar:\
$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:\
$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:\
$HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/lib/slf4j-api-*.jar  <script.pig>

Moreover, in each filepath, the version number found will be substituted for *. As an example here release 0.5.0 of HCatalog uses following jars and conf files:

$HCAT_HOME/share/hcatalog/hcatalog-core-0.5.0.jar
$HCAT_HOME/share/hcatalog/hcatalog-pig-adapter-0.5.0.jar
$HIVE_HOME/lib/hive-metastore-0.10.0.jar
$HIVE_HOME/lib/libthrift-0.7.0.jar
$HIVE_HOME/lib/hive-exec-0.10.0.jar
$HIVE_HOME/lib/libfb303-0.7.0.jar
$HIVE_HOME/lib/jdo2-api-2.3-ec.jar
$HIVE_HOME/conf
$HADOOP_HOME/conf
$HIVE_HOME/lib/slf4j-api-1.6.1.jar

c. Authentication

Make sure you have run “kinit <username>@FOO.COM” to get a Kerberos ticket and to be able to authenticate to the HCatalog server, if you are using a secure cluster and a failure results in a message like “2010-11-03 16:17:28,225 WARN hive.metastore … – Unable to connect metastore with URI thrift://…” in /tmp/<username>/hive.log.

Example of HCatalog and Pig Integration

For Example-
Now, let’s suppose we have a file employee_details.txt in HDFS, its content is:

employee_details.txt
001, Mehul, Chourey, 21, 9848022337, Hyderabad
002, Prerna, Tripathi, 22, 9848022338, Chennai
003, Shreyash, Tiwari, 22, 9848022339, Delhi
004, Kajal, Jain, 21, 9848022330, Goa
005, Revti, Vadjikar, 23, 9848022336, Banglore
006, Rishabh, Jaiswal, 23, 9848022335, Pune
007, Sagar, Joshi, 24, 9848022334, Mumbai
008, Vaishnavi, Dubey, 24, 9848022333, Indore

Now, there is one sample script we have with the name sample1_script.pig, in the same HDFS directory. Also, it have some statements performing operations and transformations on the employee relation,like:

employee = LOAD ‘hdfs://localhost:9000/pig_data/employee_details.txt’ USING
PigStorage(‘,’) as (id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray);
employee_order = ORDER employee BY age DESC;
STORE employee_order INTO ’employee_order_table’ USING org.apache.HCatalog.pig.HCatStorer();
employee_limit = LIMIT employee_order 4;
Dump employee_limit;

Now,see, data in the file named employee_details.txt as a relation named employee is stored in the first statement of the script.

Afterward,  the tuples of the relation are arranged in the second statement of the script in the descending order,  on the basis of age, as well as store it as employee_order.

Moreover, the processed data employee_order results in a separate table named employee_order_table is stored in the third statement.

And, the first four-tuples of employee_order as employee_limit will be stored in the fourth statement of the script.

Ultimately, the last and the fifth statement will dump the content of the relation employee_limit.
Further execute the sample1_script.pig, like:

$./pig -useHCatalog hdfs://localhost:9000/pig_data/sample1_script.pig

Hence, for the output (part_0000, part_0001),  check output directory (hdfs: user/tmp/hive).

So, this was all about HCatalog and Pig Integration. Hope, it helps.

Conclusion

Hence, we have seen the concept of HCatalog and Pig Integration in detail. Also, we discussed how to run Pig with HCatalog and its example. Still, if any doubt regarding HCatalog and Pig Integration, ask in the comment tab.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google

courses

DataFlair Team

DataFlair Team specializes in creating clear, actionable content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Backed by industry expertise, we make learning easy and career-oriented for beginners and pros alike.

Leave a Reply

Your email address will not be published. Required fields are marked *