Sqoop HCatalog Integration

1. Objective

In Sqoop, there is a table and storage management service for Hadoop which enables users with different data processing tools to more easily read and write data on the grid. That tool what we call is Sqoop HCatalog. So, in this article, we will learn the whole concept of Sqoop HCatalog Integration.

Sqoop HCatalog

introduction to Sqoop HCatalog Integration

2. Introduction to Sqoop HCatalog 

In Sqoop, there is a table and storage management service for Hadoop which enables users with different data processing tools to more easily read and write data on the grid. That tool what we call is Sqoop HCatalog. In the Hadoop distributed file system (HDFS) HCatalog’s table abstraction presents users with a relational view of data. Also, ensures that users need not worry about where or in what format their data is stored. Like in RCFile format, text files, or SequenceFiles.
In addition, it supports reading as well as writing files in any format for which a Hive SerDe (serializer-deserializer) is present. Also, it supports RCFile, CSV, JSON, and SequenceFile formats. However, we must provide the InputFormat and OutputFormat as well as the SerDe, to use a custom format.
Moreover, it can abstract various storage formats which used in providing the RCFile (and future file types) support to Sqoop.
Follow this link to know more about Sqoop Supported Databases & Sqoop Validation

Get the most demanding skills of IT Industry - Learn Hadoop

3. Exposing Sqoop HCatalog Tables 

Basically, with Sqoop it is patterned on an existing feature set which supports Avro and Hive tables. Moreover, it introduces seven new command line options. Also, some command line options defined for Hive can reuse.
Let’s revise Sqoop Features

a. New Command Line Options

–hcatalog-database
It specifies the database name for the Sqoop HCatalog table. Moreover, we use the default database name, if not specified. However, without –hcatalog-table offering the –hcatalog-database option is an error. 
–hcatalog-table
The argument value for this option is the HCatalog tablename.
–hcatalog-home
The home directory for the Sqoop HCatalog installation. Basically, with necessary HCatalog libraries, the directory must have a lib subdirectory and a share/hcatalog subdirectory. 
–create-hcatalog-table
When importing data it specifies Sqoop HCatalog table should be created automatically. However, HCatalog tables are assumed to exist, by default.
–hcatalog-storage-stanza
It specifies the storage stanza to be appended to the table. 
hcatalogpartition-keys andhcatalogpartition-values
To specify multiple static partition key/value pairs we use these two options.

b. Supported Sqoop Hive Options

To provide additional input to the HCatalog jobs we can use the following Sqoop options along with the –hcatalog-table option. Also, we can reuse some of the existing Hive import job options with HCatalog jobs despite creating Sqoop-HCatalog-specific options for the same purpose.
–map-column-hive
To HCatalog with a specific HCatalog type, this option maps a database column.
–hive-home
It is the Hive home location.
–hive-partition-key
Basically, we use it for static partitioning filter. However, the partitioning key should be of type STRING. Although, there can be only one static partitioning key.
–hive-partition-value
Here, all the value are associated with the partition.

c. Direct Mode support

Sqoop HCatalog integration has been enhanced to support direct mode connectors. Basically, those are high-performance connectors specific to a database. Moreover, to take advantage of this feature it is must enhance Netezza direct mode connector.
Note: Only Netezza direct mode connector is currently ready to work with HCatalog.
Follow this link to know more about Sqoop Connectors & Drivers 

d. Unsupported Sqoop Options

i. Unsupported Sqoop Hive Import Options

HCatalog jobs do not support following Sqoop Hive import options. Such as:
–hive-import
–hive-overwrite

ii. Unsupported Sqoop Export and Import Options

HCatalog job does not support following Sqoop export and import options. Such as:
–export-dir
–target-dir
–warehouse-dir
–append
–as-sequencefile
–as-avrodatafile
–as-parquetfile

e. Ignored Sqoop Options

Basically, HCatalog jobs ignore the following options. Such as:

  1. Basically, it ignores all input delimiter options.
  2. Unless we use –hive-drop-import-delims or –hive-delims-replacement, it generally ignores output delimiters.

4. Automatic Table Creation in Sqoop

While importing into Hadoop, Sqoop manages and create the table metadata. Also, it offers this feature with the option –create-hcatalog-table. Moreover, it offers storage agnosticism to Sqoop data movement jobs. Basically, for that feature, HCatalog import jobs lets a user specify the storage format for the created table.
Although we use –create-hcatalog-table option as an indicator that a table has to be created as part of the HCatalog import job. Although, the table creation will fail and the job will be aborted if the option –create-hcatalog-table is specified and the table exists.
However, if the option –hive-partition-key is specified, hence the value of this option is used as the partitioning key for the newly created table. Basically,  with this option, only one partitioning key can be specified.
Moreover, when mapped to an HCatalog table object names are mapped to the lowercase equivalents. It includes the table name as well as field names.
Follow this link to know about Sqoop Architecture

5. Delimited Text Formats – Field and Line Delimiter Characters

Basically, it supports delimited text format as one of the table storage formats. However, while the delimited text is already used also the imported data has fields which contain those delimiters, So the data may be parsed into a different number of fields and records by Hive, thereby losing data fidelity.
We can use any of these existing Sqoop import options, for this case:
–hive-delims-replacement
–hive-drop-import-delims
Note that any column of type STRING will be formatted with the Hive delimiter processing if either of these options is provided for import. Afterwards, written to the HCatalog table.

6. Sqoop HCatalog Table Requirements

Make sure, if the default table creation options are not enough then it is must that we create Sqoop-HCatalog table before using it as part of a Sqoop job. Moreover, we can use all storage formats which Hcatalog supports the creation of the HCatalog tables. Basically, this makes this feature readily adopt new storage formats that come into the Hive project. For example, ORC files.

7. Support for Partitioning

There are several table types which Sqoop HCatalog feature supports such as:

  1. Unpartitioned tables
  2. Partitioned tables with a static partitioning key specified
  3. Also, Partitioned tables with dynamic partition keys from the database result set
  4. Partitioned tables with a combination of a static key and additional dynamic partitioning keys

8. Support for Sqoop HCatalog Data Types

Basically, it supports all the Hive types which are part of Hive 0.13 version. Although it does not support complex HCatalog types.
However, for imports it only supports, BLOB/CLOB database types.

To know about Sqoop list Databases & Sqoop list Tables follow these link

9. Providing Hive and HCatalog Libraries for the Sqoop Job

Before launching the Sqoop HCatalog jobs, the environment variable HADOOP_CLASSPATH must set, to run HCatalog jobs. Like below.
HADOOP_CLASSPATH=$(hcat -classpath) export HADOOP_CLASSPATH
Also, By the Sqoop job, all the essential HCatalog dependencies will be copied to the distributed cache automatically.

Hadoop Quiz

10. Examples of Sqoop HCatalog

Now, Create an HCatalog table, like:
hcat -e “create table txn(txn_date string, cust_id string, amount float, store_id int) partitioned by (cust_id string) stored as rcfile;”
Afterwards, it invokes import and export of the “txn” HCatalog table. Like below:
Import:
$SQOOP_HOME/bin/sqoop import –connect <jdbc-url> -table <table-name> –hcatalog-table txn <other sqoop options>
Export:
$SQOOP_HOME/bin/sqoop export –connect <jdbc-url> -table <table-name> –hcatalog-table txn <other sqoop options>

11. Conclusion

So, in this Sqoop article, we have learned the whole concept of  Sqoop HCatalog Integration. Moreover, we have covered each and every aspect regarding Sqoop HCatalog. Still, if any doubt occurs regarding Sqoop HCatalog, please ask through the comment section.
See Also- Features of Sqoop & Sqoop troubleshooting Tips
For reference

Leave a Reply

Your email address will not be published. Required fields are marked *