Sqoop Export – Exporting From HDFS to RDBMS
After Sqoop Import, there is a tool which exports a set of files from HDFS back to RDBMS, that tool is what we call an Export Tool in Apache Sqoop. In this article, we will learn the whole concept of Sqoop Export. Along with Sqoop Export introduction and Purpose. Moreover, we will learn Sqoop Export syntax with example invocations to understand it well.
So, let’s start the Sqoop Export tutorial.
Introduction to Sqoop Export and its Purpose
A tool which exports a set of files from HDFS back to RDBMS. That tool is what we call a Sqooop Export Tool. There is one condition for it, that in the database, target the table must already exist. However, the input files are read and parsed according to the user-specified delimiters into a set of records.
In addition, the default operation is to transform these into a set of INSERT statements that inject the records into the database. In “update mode,” sqoop will generate UPDATE statements that replace existing records in the database, and in “call mode” sqoop will make a stored procedure call for each record.
Sqoop Export Syntax
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
However, the export arguments can be entered in any order with respect to one another, but the Hadoop generic arguments must precede any export arguments.
Table 1. Common arguments
Argument | Description |
–connect <jdbc-uri> | Specify JDBC connect string |
–connection-manager <class-name> | Specify connection manager class to use |
–driver <class-name> | Manually specify JDBC driver class to use |
–hadoop-mapred-home <dir> | Override $HADOOP_MAPRED_HOME |
–help | Print usage instructions |
–password-file | Set path for a file containing the authentication password |
-P | Read password from console |
–password <password> | Set authentication password |
–username <username> | Set authentication username |
–verbose | Print more information while working |
–connection-param-file <filename> | Optional properties file that provides connection parameters |
Table 2. Validation arguments More Details
Argument | Description |
–validate | Enable validation of data copied, supports single table copy only. |
–validator <class-name> | Specify validator class to use. |
–validation-threshold <class-name> | Specify validation threshold class to use. |
–validation-failurehandler <class-name> | Specify validation failure handler class to use. |
Table 3. Export control arguments:
Argument | Description |
–columns <col,col,col…> | Columns to export to table |
–direct | Use direct export fast path |
–export-dir <dir> | HDFS source path for the export |
-m,–num-mappers <n> | Use n map tasks to export in parallel |
–table <table-name> | Table to populate |
–call <stored-proc-name> | Stored Procedure to call |
–update-key <col-name> | Anchor column to use for updates. Use a comma separated list of columns if there are more than one column. |
–update-mode <mode> | Specify how updates are performed when new rows are found with non-matching keys in database. |
Legal values for mode include updateonly (default) and allowinsert. | |
–input-null-string <null-string> | The string to be interpreted as null for string columns |
–input-null-non-string <null-string> | The string to be interpreted as null for non-string columns |
–staging-table <staging-table-name> | The table in which data will be staged before being inserted into the destination table. |
–clear-staging-table | Indicates that any data present in the staging table can be deleted. |
–batch | Use batch mode for underlying statement execution. |
Basically, we need both the –export-dir argument and one of –table or –call. Although, it specifies the table to populate in the database (or the stored procedure to call). Also, the directory in HDFS that contains the source data.
In addition, all columns within a table are selected for export in Sqoop, by default. Moreover, by using the –columns argument we can select a subset of columns and control their ordering. Although, it should include a comma-delimited list of columns to export. Like: –columns “col1,col2,col3”.
Note:Â Columns that are not included in the –columns parameter need to have either defined default value or allow NULL values.
Difference between Inserts vs. Updates in Sqoop
As we know, sqoop export appends new rows to a table. Here each input record is transformed into an INSERT statement that adds a row to the target database table. However, if our table already contains data and has constraints, we must take care to avoid inserting records.
Since, that violate these constraints. Make sure that if an INSERT statement fails the export process will fail. However, for exporting records to a new this mode is primarily intended. Since, empty table intended to receive these results.
In addition, Sqoop will modify an existing dataset in the database, if we specify the –update-key argument. Although, here each input record is considered as an UPDATE statement. Basically, that modifies an existing row.Â
CREATE TABLE foo(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
In HDFS, consider also a dataset containing records like these:
0,this is a test,42
1,some more data,100
…
Also, Running sqoop-export –table foo –update-key id –export-dir /path/to/data –connect … will run an export job that executes SQL statements based on the data like so:
UPDATE foo SET msg=’this is a test’, bar=42 WHERE id=0;
UPDATE foo SET msg=’some more data’, bar=100 WHERE id=1;
…
We can also specify the –update-mode argument with allowinsert mode depending on the target database, if we want to update rows, if they exist in the database already or insert rows if they do not exist yet.
Table 4. Input parsing arguments:
Argument | Description |
–input-enclosed-by <char> | Sets a required field encloser |
–input-escaped-by <char> | Sets the input escape character |
–input-fields-terminated-by <char> | Sets the input field separator |
–input-lines-terminated-by <char> | Sets the input end-of-line character |
–input-optionally-enclosed-by <char> | Sets a field enclosing character |
Table 5. Output line formatting arguments:
Argument | Description |
–enclosed-by <char> | Sets a required field enclosing character |
–escaped-by <char> | Sets the escape character |
–fields-terminated-by <char> | Sets the field separator character |
–lines-terminated-by <char> | Sets the end-of-line character |
–mysql-delimiters | Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: |
–optionally-enclosed-by <char> | Sets a field enclosing character |
Moreover, Sqoop will fail to find enough columns per line, if we specify incorrect delimiters. Basically, that will cause export map tasks to fail by throwing ParseExceptions.
Table 6. Code generation arguments:
Argument | Description |
–bindir <dir> | Output directory for compiled objects |
–class-name <name> | Sets the generated class name. This overrides –package-name. When combined with –jar-file, sets the input class. |
–jar-file <file> | Disable code generation; use specified jar |
–outdir <dir> | Output directory for generated code |
–package-name <name> | Put auto-generated classes in this package |
–map-column-java <m> | Override default mapping from SQL type to Java type for configured columns. |
Exports and Transactions in Sqoop
However, by multiple writers in parallel Exports are performed. Basically, each writer uses a separate connection to the database. These have separate transactions from one another. Although, to insert up to 100 records per statement Sqoop uses the multi-row INSERT syntax. The current transaction within a writing task is committed, every 100 statements, that is causing a commit every 10,000 rows.
That ensures that transaction buffers do not grow without bound. Also, cause out-of-memory conditions. Hence, a Sqoop export is not an atomic process. Note that, before the export is complete Partial results from the export will become visible.
Failed Sqoop Export
There are various reasons due to which export in sqoop may fail such as:
- Due to loss of connectivity from the Hadoop cluster to the database. Either due to a hardware fault, or server software crashes.
- By attempting to INSERT a row. That violates a consistency constraint. For example, inserting a duplicate primary key value.
- Also, attempting to parse an incomplete or malformed record from the HDFS source data
- Through attempting to parse records using incorrect delimiters
- Due to capacity issues. like insufficient RAM or disk space.
Sqoop Exports Example
To populate a table named bar, a basic export in Sqoop is:
$ sqoop export –connect jdbc:mysql://db.example.com/foo –table bar  \
–export-dir /results/bar_data
Here, this example takes the files in /results/bar_data. Afterwards injects their contents into the bar table in the foo database on db.example.com. However, the first condition is that in the database the target table must already exist. Also, Sqoop performs a set of INSERT INTO operations, without regard for existing content.
To populate a table named bar with validation enabled: More Details, another basic Sqoop export
$ sqoop export –connect jdbc:mysql://db.example.com/foo –table bar  \
–export-dir /results/bar_data –validate
For every record in /results/bar_data an export that calls a stored procedure named barproc would look like:
$ sqoop export –connect jdbc:mysql://db.example.com/foo –call barproc \
–export-dir /results/bar_data
Conclusion – Sqoop Export Tutorial
As a result, we have learned the whole concept of Sqoop Export, its purpose, syntax, Sqoop export updates vs insert, Exports and transactions in Sqoop, Failed Export in sqoop an example invocation. However, if you want to ask any Query, feel free to ask in the comment section.
See Also- Sqoop Architecture & Sqoop Jobs
For reference
If you are Happy with DataFlair, do not forget to make us happy with your positive feedback on Google
Hi,
Could you please post a blog on taking incremental backup in hive using sqoop from a RDBMS.
Thanks,
Sumit
Hello, I have a column in my table which has trailing spaces( not every value has space at the end). How to retain the spaces when performing an export?
can we use ‘where clause’ in sqoop export command?
Hi Divyanxi K Mistry,
Thank You for writing your query on “Sqoop Export”, well as per your concern, we would like to say, this clause is not supported with Sqoop Export yet, although we will work more on it to clarify your query in best way, until that you can refer our Introductory blog on Apache Sqoop, there you can see Export control arguments in Table. 7, to learn about the available option in Sqoop export.
Hope it helps!
Data-Flair
I’ve created an external table in HIVE over Parquet:
drop table H_LINEITEM_EXT;
create external table H_LINEITEM_EXT
(
L_ORDERKEY string,
L_PARTKEY string,
L_SUPPKEY string,
L_LINENUMBER string,
L_QUANTITY string,
L_EXTENDEDPRICE string,
L_DISCOUNT string,
L_TAX string,
L_RETURNFLAG string,
L_LINESTATUS string,
L_SHIPDATE date ,
L_COMMITDATE date ,
L_RECEIPTDATE date ,
L_SHIPINSTRUCT string,
L_SHIPMODE string,
L_COMMENT string
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’
STORED AS
INPUTFORMAT “org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat”
OUTPUTFORMAT “org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat”
LOCATION ‘/hdata/H_LINEITEM’;
I want to export this table to ORACLE with SQOOP:
sqoop export \
-Dsqoop.export.records.per.statement=1000000 \
-Dsqoop.export.statements.per.transaction=1000000 \
–connect “jdbc:oracle:thin:@192.168.1.32:1522:orcl” –password “system” –username “sys as sysdba” \
–table “TPCH.HIVE_LINEITEM” –hcatalog-table “H_LINEITEM_EXT” –hcatalog-database “default” –hcatalog-home /home/hadoop/hive \
-m 3 \
–batch \
–verbose
The process failed with the following exception:
2019-09-03 18:23:07,643 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.ClassCastException: org.apache.hadoop.hive.common.type.Date cannot be cast to java.sql.Date
at org.apache.sqoop.mapreduce.hcat.SqoopHCatExportHelper.convertToSqoop(SqoopHCatExportHelper.java:193)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatExportHelper.convertToSqoopRecord(SqoopHCatExportHelper.java:138)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatExportMapper.map(SqoopHCatExportMapper.java:56)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatExportMapper.map(SqoopHCatExportMapper.java:35)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Specific version I use:
hadoop-3.1.1
sqoop-1.4.7
I’ve tried with :
–map-column-java “l_shipdate=String,l_commitdate=String,l_receiptdate=String”
or
–map-column-java “L_SHIPDATE=java.sql.Date,L_COMMITDATE=java.sql.Date,L_RECEIPTDATE=java.sql.Date”
But no luck.
If I create a table in Hive with all string columns from the external table H_LINEITEM_EXT, then I can manage to import that table successfully. But this is not a good solution because it duplicates the time and space.
Could you please advise me on the right solution?