Sqoop Merge – Tool to Combine Datasets in Sqoop
Stay updated with latest technology trends
Join DataFlair on Telegram!!
While it comes to combine two datasets we generally use merge tool in Sqoop. However, there are many more insights of Sqoop Merge available. So, in this article, we will learn the whole concept of Sqoop Merge. Also, we will see the syntax of Sqoop merge as well as arguments to understand the whole topic in depth.
So, let’s start Sqoop Merge tutorial.
2. What is Sqoop Merge?
While it comes to combine two datasets we generally use merge tool in Sqoop. Especially, where entries in one dataset should overwrite entries of an older dataset. To understand it well, let’s see an example of a sqoop merge.
Here, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. Moreover, this tool will “flatten” two datasets into one, taking the newest available records for each primary key.
3. Sqoop Merge Syntax & Arguments
$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)
However, the job arguments can be entered in any order with respect to one another while the Hadoop generic arguments must precede any merge arguments.
Table. Merge options:
|–class-name <class>||Specify the name of the record-specific class to use during the merge job.|
|–jar-file <file>||Specify the name of the jar to load the record class from.|
|–merge-key <col>||Specify the name of a column to use as the merge key.|
|–new-data <path>||Specify the path of the newer dataset.|
|–onto <path>||Specify the path of the older dataset.|
|–target-dir <path>||Specify the target path for the output of the merge job.|
Whenever this tool runs a MapReduce job that takes two directories as input. Such as a newer dataset, and an older one. However, we can specify them with –new-data and –onto respectively. Moreover, MapReduce job’s output will be placed in the directory in HDFS specified by –target-dir.
In addition, here we assume that there is a unique primary key value in each record when we merge the datasets. Moreover, –merge-key specifies the column for the primary key. Also, make sure that data loss may occur if multiple rows in the same dataset have the same primary key.
Also, we must use the auto-generated class from a previous import to parse the dataset and extract the key column. Moreover, with –class-name and –jar-file we should specify the class name as well as jar file. Although, we can recreate the class using the codegen tool if this is not available.
To be more specific, this tool typically runs after an incremental import with the date-last-modified mode (sqoop import –incremental lastmodified …).
Now let’s suppose that two incremental imports were performed, in an HDFS directory where some older data named older and newer data named newer. Hence, they could be merged as:
$ sqoop merge –new-data newer –onto older –target-dir merged \
–jar-file datatypes.jar –class-name Foo –merge-key id
However, this would run a MapReduce job. Especially where we use the value in the id column of each row to join rows. Also, note that in preference to rows in the older dataset we use rows in the newer dataset.
Basically, we can use it with both SequenceFile-, Avro- and text-based incremental imports. However, ensure that both the file types of the newer and older datasets are same.
So, this was all in Sqoop Merge tutorial. Hope you like our explanation.
As a result, we have seen complete information on Sqoop Merge purpose, syntax, and Sqoop Merge Arguments. Still, if you feel any doubt regarding, feel free to ask through the comment section.
See also- Sqoop HCatalog & Sqoop Installation