Sqoop Validation – Interfaces & Limitations of Sqoop Validate
Sqoop Validation is nothing but to validate the data copied. In this article, we will learn the whole concept of validation in Sqoop. After the introduction, we will learn the purpose of validation in sqoop. Moreover, we will also cover sqoop validation interface, Sqoop validation syntax & configuration, examples and limitations of sqoop validation as well.
Introduction to Sqoop Validation
a. Sqoop validation simply means validate the data copied. Basically, either import or Export by comparing the row counts from the source as well as the target post copy.
b. Moreover, we use this option to compare the row counts between source as well as the target just after data imported into HDFS.
c. While during the imports, all the rows are deleted or added, Sqoop tracks this change. Also updates the log file.
Interfaces of Sqoop Validation
Basically, there are 3 interfaces of Sqoop Validation such as:
a. ValidationThreshold
We use the ValidationThreshold to determine whether the error margin between the source and target are acceptable: Absolute, Percentage Tolerant and many more. However, the default implementation is AbsoluteValidationThreshold.
Basically, that ensures that the row counts from source as well as targets are the same.
b. ValidationFailureHandler
Also, it has once interface with ValidationFailureHandler, that is responsible for handling failures here. Such as log an error/warning, abort and many more. Although default implementation is LogOnFailureHandler. Here that logs a warning message to the configured logger.
c. Validator
Basically, by delegating the decision to ValidationThreshold Validator drives the validation logic. Also delegates failure handling to ValidationFailureHandler. Moreover, the default implementation is RowCountValidator here. That validates the row counts from source as well as the target.
Purpose to Validate in Sqoop
Validation in sqoop’s main purpose is to validate the data copied. Basically, either Sqoop import or Export by comparing the row counts from the source as well as the target post copy.
Syntax of Sqoop Validation
$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)
Basically, validation arguments are part of import and export arguments.
Sqoop Validation Configuration
We can say that the validation framework is extensible as well as pluggable in nature. However, it comes with default implementations. Yet we can extend the interfaces to allow custom implementations. Basically, it is possible by passing them as part of the command line arguments as shown below.
a. Validator
Property | Validator |
Description | It is a Driver for validation, must implement org.apache.sqoop.validation.Validator |
Supported values | Important to note that, the value has to be a fully qualified class name |
Default value | org.apache.sqoop.validation.RowCountValidator |
b. Validation Threshold
Property | Validation-threshold |
Description | It drives the decision on the basis of validation meeting the threshold or not. Must implement    org.apache.sqoop.validation.ValidationThreshold |
Supported values | Here, also it is important that the value has to be a fully qualified class name. |
Default value | org.apache.sqoop.validation.AbsoluteValidationThreshold |
c. Validation Failure Handler
Property | Validation-failurehandler |
Description | Basically, it is responsible for handling failures, must implement org.apache.sqoop.validation.ValidationFailureHandler |
Supported values | Likewise, again it is important that the value has to be a fully qualified class name |
Default value | org.apache.sqoop.validation.AbortOnFailureHandler |
Limitations of Sqoop Validation
Since it validates only data copied from a single table into HDFS currently. So, there are several limitations in the current implementation. Such as:
- Firstly, all-tables option.
- Since it is free-form query option.
- Basically, data is only imported into Hive, HBase or Accumulo.
- Moreover, we use –where argument for table imports.
- Also, incremental imports.
Example Invocations of Sqoop Validation
However, a basic import of a table named EMPLOYEES in the corp database. basically, that uses validation to validate the row counts:
For Example,
$ sqoop import –connect jdbc:mysql://db.foo.com/corp  \
–table EMPLOYEES –validate
Moreover, a basic export to populate a table named bar with sqoop validation enabled:
For Example,
$ sqoop export –connect jdbc:mysql://db.example.com/foo –table bar  \
–export-dir /results/bar_data –validate
Now, another example that overrides the sqoop validation args:
For Example,
$ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \
–validate –validator org.apache.sqoop.validation.RowCountValidator \
–validation-threshold \
org.apache.sqoop.validation.AbsoluteValidationThreshold \
–validation-failurehandler \
org.apache.sqoop.validation.AbortOnFailureHandler
Conclusion
In this blog, we have learned the whole concept of Sqoop Validation. Also, we have seen various examples of Validation in sqoop. However, if you want to ask any Query regarding, please ask through the comment section. Then, we will definitely get back to you.
See Also- Sqoop Import Mainframe & Sqoop Troubleshooting
For reference
If you are Happy with DataFlair, do not forget to make us happy with your positive feedback on Google
Is there any way to Log this row counts into table
how to handle the datatypes in sqoop