Sqoop Job – Creating And Executing | Saved Jobs

Boost your career with Free Big Data Courses!!

In this Sqoop Tutorial, we discuss what is Sqoop Job. Sqoop Job allows us to create and work with saved jobs in sqoop. First, we will start with a brief introduction to a  Sqoop Saved Job. Afterward, we will move forward to the sqoop job, we will learn purpose and syntax of a sqoop job. Also, we will cover the method to create a job in Sqoop and Sqoop Job incremental import.

Saved Jobs in Sqoop

Basically, by issuing the same command multiple times we can perform imports and exports in sqoop repeatedly. Moreover, we can say it is a most expected scenario while using the incremental import capability.
In addition, we can define saved jobs by Sqoop.

Basically, that makes this process easier. Moreover, to execute a Sqoop command at a later time we need some information that configuration information is recorded by a sqoop saved job.

Moreover, note that the job descriptions are saved to a private repository stored in $HOME/.sqoop/, by default. Also, we can configure Sqoop to instead use a shared metastore. However, that makes saved jobs available to multiple users across a shared cluster.
Learn more about Sqoop Codegen

What is Sqoop Job?

Basically, Sqoop Job allows us to create and work with saved jobs. However, to specify a job, Saved jobs remember the parameters we use. Hence, we can re-execute them by invoking the job by its handle. However, we use this re-calling or re-executing in the incremental import.

That can import the updated rows from RDBMS table to HDFS.
In other words, to perform an incremental import if a saved job is configured, then state regarding the most recently imported rows is updated in the saved job. Basically, that allows the job to continually import only the newest rows.

Syntax of Sqoop Job

$ sqoop job (generic-args) (job-args) [– [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [– [subtool-name] (subtool-args)]
However, the Sqoop job arguments can be entered in any order with respect to one another. But the Hadoop generic arguments must precede any job arguments.
Table 1. Job management options

ArgumentDescription
–create <job-id>Define a new saved job with the specified job-id (name). A second Sqoop
–delete <job-id>Delete a saved job.
–exec <job-id>Given a job defined with –create, run the saved job.
–show <job-id>Show the parameters for a saved job.
–listList all saved jobs

Read about Sqoop List Tables in detail

How to Create Sqoop Job?

By using the –create action we can create the saved job in sqoop. Moreover, this operation requires a — followed by a tool name and its arguments. Afterwards, the tool and its arguments will form the basis of the saved job. Consider:
$ sqoop job –create myjob — import –connect jdbc:mysql://example.com/db \
   –table mytable
Basically, it creates a Sqoop job that we call myjob. Also, we can execute it later. Make sure the job is not to run. However, now this job is available in the list of saved jobs:
$ sqoop job –list
Available jobs:
 myjob
Read about Sqoop Eval

Inspect Job in Sqoop

By using the show action we can inspect the configuration of a job:
$ sqoop job –show myjob
Job: myjob
Tool: import
Options:
—————————-
direct.import = false
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = mytable
…
However, we can run the job with exec also, if we are satisfied with it:
$ sqoop job –exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
…
Basically, here the exec action allows us to override arguments of the saved job. It is possible by supplying them after a –.
Let’s take an example to understand, to require a username, if the database were changed, So, we could specify the username and password with:
For Example,
$ sqoop job –exec myjob — –username someuser -P
Enter password:
…
Table 2. Metastore connection options

ArgumentDescription
–meta-connect <jdbc-uri>Specifies the JDBC connect string used to connect to the metastore

In addition, we can configure sqoop.metastore.client.autoconnect.url with this address, in conf/sqoop-site.xml. Hence, there is no need to supply –meta-connect to use a remote metastore. Although, to move the private metastore to a location on your filesystem other than your home directory we can modify this parameter.
Let’s know about Sqoop Metastore in detail

However, there is one condition, we must explicitly supply –meta-connect if we configure sqoop.metastore.client.enable.autoconnect with the value false.
Table 3. Common options

ArgumentDescription
–helpPrint usage instructions
–verbosePrint more information while working

Sqoop Saved Jobs and Passwords

Basically, multiple users can access Sqoop metastore since it is not a secure resource. Hence, Sqoop does not store passwords in the metastore. So, for the security purpose, you will be prompted for that password each time you execute the job if we create a sqoop job that requires a password.

In addition, by setting sqoop.metastore.client.record.password to true in the configuration we can easily enable passwords in the metastore.

Note: If we are executing saved jobs via Oozie we have to set sqoop.metastore.client.record.password to true. It is important since when executed as Oozie tasks, Sqoop cannot prompt the user to enter passwords.

Sqoop Job Incremental Imports

Basically, by comparing the values in a check column against a reference value for the most recent import all the sqoop incremental imports are performed.

Conclusion

As a result, we have seen the complete content regarding Sqoop Jobs. Also, we have seen the way to create or sqoop job incremental import. Apart from that, we have also learned syntax and purpose of a sqoop job to understand.

Moreover, we have seen some essential arguments also. However, still, if you feel any doubts regarding, feel free to ask in the comment section. We assure you that we will get back to you.

See Also- Sqoop Merge
For reference

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google

courses

DataFlair Team

At DataFlair, our team of experts is committed to offering high-quality tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. We simplify learning to help you advance your career in the tech world.

1 Response

  1. Dillip says:

    1. I have imported data using lastmodified to a spwcific directory. But when I try next time to import to same directory I get FileAlreadyExistOption…Output directory already exist. How this scenario is hanled in real time/production environment. Below is my command

    sqoop import –connect “jdbc:mysql://localhost.localdomain:3306/training” –table sqoopincrement –username training –password training
    –target-dir sqoop/incremental_import_2 –incremental lastmodified –check-column Date –last-value ‘2020-01-25 00:00:00’

    2. The command works fine if I add append along with lastmodified like below

    sqoop import –connect “jdbc:mysql://localhost.localdomain:3306/training” –table sqoopincrement –username training –password training
    –target-dir sqoop/incremental_import_2 –append –incremental lastmodified –check-column Date –last-value ‘2020-01-25 01:08:41’

Leave a Reply

Your email address will not be published. Required fields are marked *