Sqoop Validation – Interfaces & Limitations of Sqoop Validate

1. Objective

Sqoop Validation is nothing but to validate the data copied. In this article, we will learn the whole concept of validation in Sqoop. After the introduction, we will learn the purpose of validation in sqoop. Moreover, we will also cover sqoop validation interface, Sqoop validation syntax & configuration, examples and limitations of sqoop validation as well.

Sqoop Validation - Interfaces & Limitations of Sqoop Validate

Sqoop Validation – Interfaces & Limitations of Sqoop Validate

2. Introduction to Sqoop Validation

a. Sqoop validation simply means validate the data copied. Basically, either import or Export by comparing the row counts from the source as well as the target post copy.
b. Moreover, we use this option to compare the row counts between source as well as the target just after data imported into HDFS.
c. While during the imports, all the rows are deleted or added, Sqoop tracks this change. Also updates the log file.

Sqoop Validation

Sqoop Validation – Sqoop Process

Get the most demanding skills of IT Industry - Learn Hadoop

3. Interfaces of Sqoop Validation

Basically, there are 3 interfaces of Sqoop Validation such as:

a. ValidationThreshold

We use the ValidationThreshold to determine whether the error margin between the source and target are acceptable: Absolute, Percentage Tolerant and many more. However, the default implementation is AbsoluteValidationThreshold.
Basically, that ensures that the row counts from source as well as targets are the same.

b. ValidationFailureHandler

Also, it has once interface with ValidationFailureHandler, that is responsible for handling failures here. Such as log an error/warning, abort and many more. Although default implementation is LogOnFailureHandler. Here that logs a warning message to the configured logger.

c. Validator

Basically, by delegating the decision to ValidationThreshold Validator drives the validation logic. Also delegates failure handling to ValidationFailureHandler. Moreover, the default implementation is RowCountValidator here. That validates the row counts from source as well as the target.

4. Purpose to Validate in Sqoop

Validation in sqoop’s main purpose is to validate the data copied. Basically,  either Sqoop import or Export by comparing the row counts from the source as well as the target post copy.

Sqoop Validation

Sqoop Validation – Purpose

5. Syntax of Sqoop Validation

$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)
Basically, validation arguments are part of import and export arguments.

6. Sqoop Validation Configuration

We can say that the validation framework is extensible as well as pluggable in nature. However, it comes with default implementations. Yet we can extend the interfaces to allow custom implementations. Basically, it is possible by passing them as part of the command line arguments as shown below.

a. Validator

DescriptionIt is a Driver for validation, must implement org.apache.sqoop.validation.Validator
Supported valuesImportant to note that, the value has to be a fully qualified class name
Default valueorg.apache.sqoop.validation.RowCountValidator

b. Validation Threshold

DescriptionIt drives the decision on the basis of validation meeting the threshold or not. Must implement  org.apache.sqoop.validation.ValidationThreshold
Supported valuesHere, also it is important that the value has to be a fully qualified class name.
Default valueorg.apache.sqoop.validation.AbsoluteValidationThreshold

c. Validation Failure Handler

DescriptionBasically, it is responsible for handling failures, must implement org.apache.sqoop.validation.ValidationFailureHandler
Supported valuesLikewise, again it is important that the value has to be a fully qualified class name
Default valueorg.apache.sqoop.validation.AbortOnFailureHandler

7. Limitations of Sqoop Validation

Since it validates only data copied from a single table into HDFS currently. So, there are several limitations in the current implementation. Such as:

  • Firstly, all-tables option.
  • Since it is free-form query option.
  • Basically,  data is only imported into Hive, HBase or Accumulo.
  • Moreover, we use –where argument for table imports.
  • Also, incremental imports.
Hadoop Quiz

8. Example Invocations of Sqoop Validation

However, a basic import of a table named EMPLOYEES in the corp database. basically, that uses validation to validate the row counts:
For Example,
$ sqoop import –connect jdbc:mysql://db.foo.com/corp  \
–table EMPLOYEES –validate
Moreover, a basic export to populate a table named bar with sqoop validation enabled:
For Example,
$ sqoop export –connect jdbc:mysql://db.example.com/foo –table bar  \
–export-dir /results/bar_data –validate
Now, another example that overrides the sqoop validation args:
For Example,
$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \
–validate –validator org.apache.sqoop.validation.RowCountValidator \
–validation-threshold \
org.apache.sqoop.validation.AbsoluteValidationThreshold \
–validation-failurehandler \

9. Conclusion

In this blog, we have learned the whole concept of Sqoop Validation. Also, we have seen various examples of Validation in sqoop. However, if you want to ask any Query regarding, please ask through the comment section. Then, we will definitely get back to you.
See Also- Sqoop Import Mainframe & Sqoop Troubleshooting
For reference

2 Responses

  1. Nick says:

    Is there any way to Log this row counts into table

  2. Murali Krishna says:

    how to handle the datatypes in sqoop

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.