Sqoop HCatalog Integration
In Sqoop, there is a table and storage management service for Hadoop which enables users with different data processing tools to more easily read and write data on the grid. That tool what we call is Sqoop HCatalog. So, in this article, we will learn the whole concept of Sqoop HCatalog Integration.
2. Introduction to Sqoop HCatalog
In Sqoop, there is a table and storage management service for Hadoop which enables users with different data processing tools to more easily read and write data on the grid. That tool what we call is Sqoop HCatalog. In the Hadoop distributed file system (HDFS) HCatalog’s table abstraction presents users with a relational view of data. Also, ensures that users need not worry about where or in what format their data is stored. Like in RCFile format, text files, or SequenceFiles.
In addition, it supports reading as well as writing files in any format for which a Hive SerDe (serializer-deserializer) is present. Also, it supports RCFile, CSV, JSON, and SequenceFile formats. However, we must provide the InputFormat and OutputFormat as well as the SerDe, to use a custom format.
Moreover, it can abstract various storage formats which used in providing the RCFile (and future file types) support to Sqoop.
Follow this link to know more about Sqoop Supported Databases & Sqoop Validation
3. Exposing Sqoop HCatalog Tables
Basically, with Sqoop it is patterned on an existing feature set which supports Avro and Hive tables. Moreover, it introduces seven new command line options. Also, some command line options defined for Hive can reuse.
Let’s revise Sqoop Features
a. New Command Line Options
It specifies the database name for the Sqoop HCatalog table. Moreover, we use the default database name, if not specified. However, without –hcatalog-table offering the –hcatalog-database option is an error.
The argument value for this option is the HCatalog tablename.
The home directory for the Sqoop HCatalog installation. Basically, with necessary HCatalog libraries, the directory must have a lib subdirectory and a share/hcatalog subdirectory.
When importing data it specifies Sqoop HCatalog table should be created automatically. However, HCatalog tables are assumed to exist, by default.
It specifies the storage stanza to be appended to the table.
—hcatalog–partition-keys and —hcatalog–partition-values
To specify multiple static partition key/value pairs we use these two options.
b. Supported Sqoop Hive Options
To provide additional input to the HCatalog jobs we can use the following Sqoop options along with the –hcatalog-table option. Also, we can reuse some of the existing Hive import job options with HCatalog jobs despite creating Sqoop-HCatalog-specific options for the same purpose.
To HCatalog with a specific HCatalog type, this option maps a database column.
It is the Hive home location.
Basically, we use it for static partitioning filter. However, the partitioning key should be of type STRING. Although, there can be only one static partitioning key.
Here, all the value are associated with the partition.
c. Direct Mode support
Sqoop HCatalog integration has been enhanced to support direct mode connectors. Basically, those are high-performance connectors specific to a database. Moreover, to take advantage of this feature it is must enhance Netezza direct mode connector.
Note: Only Netezza direct mode connector is currently ready to work with HCatalog.
Follow this link to know more about Sqoop Connectors & Drivers
d. Unsupported Sqoop Options
i. Unsupported Sqoop Hive Import Options
HCatalog jobs do not support following Sqoop Hive import options. Such as:
ii. Unsupported Sqoop Export and Import Options
HCatalog job does not support following Sqoop export and import options. Such as:
e. Ignored Sqoop Options
Basically, HCatalog jobs ignore the following options. Such as:
- Basically, it ignores all input delimiter options.
- Unless we use –hive-drop-import-delims or –hive-delims-replacement, it generally ignores output delimiters.
4. Automatic Table Creation in Sqoop
While importing into Hadoop, Sqoop manages and create the table metadata. Also, it offers this feature with the option –create-hcatalog-table. Moreover, it offers storage agnosticism to Sqoop data movement jobs. Basically, for that feature, HCatalog import jobs lets a user specify the storage format for the created table.
Although we use –create-hcatalog-table option as an indicator that a table has to be created as part of the HCatalog import job. Although, the table creation will fail and the job will be aborted if the option –create-hcatalog-table is specified and the table exists.
However, if the option –hive-partition-key is specified, hence the value of this option is used as the partitioning key for the newly created table. Basically, with this option, only one partitioning key can be specified.
Moreover, when mapped to an HCatalog table object names are mapped to the lowercase equivalents. It includes the table name as well as field names.
Follow this link to know about Sqoop Architecture
5. Delimited Text Formats – Field and Line Delimiter Characters
Basically, it supports delimited text format as one of the table storage formats. However, while the delimited text is already used also the imported data has fields which contain those delimiters, So the data may be parsed into a different number of fields and records by Hive, thereby losing data fidelity.
We can use any of these existing Sqoop import options, for this case:
Note that any column of type STRING will be formatted with the Hive delimiter processing if either of these options is provided for import. Afterwards, written to the HCatalog table.
6. Sqoop HCatalog Table Requirements
Make sure, if the default table creation options are not enough then it is must that we create Sqoop-HCatalog table before using it as part of a Sqoop job. Moreover, we can use all storage formats which Hcatalog supports the creation of the HCatalog tables. Basically, this makes this feature readily adopt new storage formats that come into the Hive project. For example, ORC files.
7. Support for Partitioning
There are several table types which Sqoop HCatalog feature supports such as:
- Unpartitioned tables
- Partitioned tables with a static partitioning key specified
- Also, Partitioned tables with dynamic partition keys from the database result set
- Partitioned tables with a combination of a static key and additional dynamic partitioning keys
8. Support for Sqoop HCatalog Data Types
Basically, it supports all the Hive types which are part of Hive 0.13 version. Although it does not support complex HCatalog types.
However, for imports it only supports, BLOB/CLOB database types.
9. Providing Hive and HCatalog Libraries for the Sqoop Job
Before launching the Sqoop HCatalog jobs, the environment variable HADOOP_CLASSPATH must set, to run HCatalog jobs. Like below.
HADOOP_CLASSPATH=$(hcat -classpath) export HADOOP_CLASSPATH
Also, By the Sqoop job, all the essential HCatalog dependencies will be copied to the distributed cache automatically.
10. Examples of Sqoop HCatalog
Now, Create an HCatalog table, like:
hcat -e “create table txn(txn_date string, cust_id string, amount float, store_id int) partitioned by (cust_id string) stored as rcfile;”
Afterwards, it invokes import and export of the “txn” HCatalog table. Like below:
$SQOOP_HOME/bin/sqoop import –connect <jdbc-url> -table <table-name> –hcatalog-table txn <other sqoop options>
$SQOOP_HOME/bin/sqoop export –connect <jdbc-url> -table <table-name> –hcatalog-table txn <other sqoop options>
So, in this Sqoop article, we have learned the whole concept of Sqoop HCatalog Integration. Moreover, we have covered each and every aspect regarding Sqoop HCatalog. Still, if any doubt occurs regarding Sqoop HCatalog, please ask through the comment section.
See Also- Features of Sqoop & Sqoop troubleshooting Tips