Top 50 Apache Sqoop Interview Questions & Answers

Whenever you are going for an interview even it is for Hadoop, it is a huge possibility that interviewer throws a curveball at you by start your interview with Hadoop tools. So, here you can prepare one of the most important tools available in Hadoop ecosystem that is Sqoop.

This document covers best of all Sqoop Interview Questions. However, before we dive into Apache Sqoop Interview Questions and answers, let’s take a look at its brief introduction.

So, let’s explore top Sqoop Interview Questions.

Introduction to Apache Sqoop

Sqoop − “SQL to Hadoop and Hadoop to SQL”
While it comes to transfer data between Hadoop and relational database servers, Sqoop is the best tool. To be more specific, we use it to import data from relational databases such as MySQL, Oracle to Hadoop  HDFS, and export from the Hadoop file system to relational databases. Basically, it is provided by the Apache Software Foundation.

Moreover, Sqoop uses two main tools. Like:

  1. Sqoop import  (Copy data from RDBMS to HDFS)
  2. Sqoop export  (Copy data from HDFS to RDBMS)

Now, Let’s start discussing best Apache Sqoop Interview Questions.

Frequently asked Sqoop Interview Questions

So, here are top 50 Sqoop Interview Questions:

Que 1. Mention the best features of Apache Sqoop.

Ans. Apache Sqoop is a tool in Hadoop ecosystem have several advantages. Like

  1. Parallel import/export
  2. Connectors for all major RDBMS Databases
  3. Import results of SQL query
  4. Incremental Load
  5. Full Load
  6. Kerberos Security Integration
  7. Load data directly into Hive / HBase
  8. Compression
  9. Support for Accumulo

To learn all features in detail, follow link: Sqoop Features 

Que 2. What is Sqoop Import? Explain its purpose.

Ans. While it comes to import tables from RDBMS to HDFS we use Sqoop Import tool. Generally, we can consider that each row in a table is a record in HDFS. Also, when we talk about text files all records are there as text data. However, when we talk about Avro and sequence files all records are there as binary data here. To be more specific,  it imports individual tables from RDBMS to HDFS.

Although, there are many more insights of Sqoop Import, to learn all in detail, follow the link: Sqoop import

Que 3. What is the default file format to import data using Apache Sqoop?

Ans. By using two file formats Sqoop allows data import. Such as:

i) Delimited Text File Format

Basically, to import data using Sqoop this is the default file format. Moreover, to the import command in Sqoop, this file format can be explicitly specified using the –as-textfile argument. Likewise, passing this argument will produce the string-based representation of all the records to the output files with the delimited characters between rows and columns.

ii) Sequence File Format

We can say, Sequence file format is a binary file format. Their records are stored in custom record-specific data types which are shown as Java classes. In addition, Sqoop automatically creates these data types and manifests them as java classes.

Que 4. How can I import large objects (BLOB and CLOB objects) in Apache Sqoop?

Ans. However, direct import of BLOB and CLOB large objects is not supported by Apache Sqoop import command. So, in order to import large objects like I Sqoop, JDBC based imports have to be used without the direct argument to the import utility.

To learn Sqoop Import in detail, follow this link.

Que 5. How can you execute a free-form SQL query in Sqoop to import the rows in a sequential manner?

Ans. By using the –m 1 option in the Sqoop import command we can accomplish it. Basically, it will create only one MapReduce task which will then import rows serially.

Que 6. Does Apache Sqoop have a default database?

Ans. Yes, MySQL is the default database.
To learn Sqoop List Databases in detail, follow this link.

Que 7. How will you list all the columns of a table using Apache Sqoop?

Ans. Since to list all the columns we do not have any direct command like sqoop-list-columns. So, indirectly we can achieve this is to retrieve the columns of the desired tables and redirect them to a file that can be viewed manually containing the column names of a particular table.

Sqoop import –m 1 –connect ‘jdbc: sqlserver: //nameofmyserver; database=nameofmydatabase; username=DeZyre; password=mypassword’ –query “SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name=’mytableofinterest’ AND \$CONDITIONS” –target-dir ‘mytableofinterest_column_name’
To learn Sqoop list Table in detail, follow this link.

Que 8.  If the source data gets updated every now and then, how will you synchronize the data in HDFS that is imported by Sqoop?

Ans. By using incremental parameter with data import we can synchronize the data–
–However, with one of the two options, we can use incremental parameter-

i) append
Basically, we should use incremental import with append option. Even if the table is getting updated continuously with new rows and increasing row id values then. Especially, where values of some of the columns are checked (columns to be checked are specified using –check-column) and if it discovers any modified value for those columns then only a new row will be inserted.

ii) lastmodified
However, in this kind of incremental import, the source has a date column which is checked for. Any records that have been updated after the last import based on the lastmodifed column in the source, the values would be updated.
To learn Sqoop list Table in detail, follow this link.

Que 9. Name a few import control commands. How can Sqoop handle large objects?

Ans. To import RDBMS data, we use import control commands
Append: Append data to an existing dataset in HDFS.
–append
Columns: columns to import from the table.
–columns
<col,col……> •
Where: where clause to use during import. —
Where the common large objects are Blog and Clob. Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set lob limit as ZERO (0) then it is stored in external memory.
To learn Sqoop Import in detail, follow this link.

Que 10. How can we import data from particular row or column? What is the destination types allowed in Sqoop import command?

Ans. Basically, on the basis of where clause, Sqoop allows to Export and Import the data from the data table. So, the syntax is
–columns
<col1,col2……> –where
–query

For Example:
sqoop import –connect jdbc:mysql://db.one.com/corp –table INTELLIPAAT_EMP –where “start_date> ’2016-07-20’ ”
sqoopeval –connect jdbc:mysql://db.test.com/corp –query “SELECT * FROM intellipaat_emp LIMIT 20”
sqoop import –connect jdbc:mysql://localhost/database –username root –password aaaaa –columns “name,emp_id,jobtitle”
However, into following services Sqoop supports data imported:

  1. HDFS
  2. Hive
  3. Hbase
  4. Hcatalog
  5. Accumulo

To learn Sqoop Supported Databases in detail, follow this link.

Sqoop Interview Questions for Freshers are Q. 1,2,5,6,7

Sqoop Interview Questions for Experience are Q. 3,4,8,9,10

Que 11. When to use –target-dir and when to use –warehouse-dir while importing data?

Ans. Basically, we use –target-dir to specify a particular directory in HDFS. Whereas we use –warehouse-dir to specify the parent directory of all the sqoop jobs. So, in this case under the parent directory sqoop will create a directory with the same name as the table.

Que 12. What is the process to perform an incremental data load in Sqoop?

Ans. In Sqoop, the process to perform incremental data load is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. Moreover, in Sqoop the delta data can be facilitated through the incremental load command.
In addition, by using Sqoop import command we can perform incremental load. Also, by loading the data into the hive without overwriting it. However, in Sqoop the different attributes that need to be specified during incremental load are
1) Mode (incremental)
It shows how Sqoop will determine what the new rows are. Also, it has value as Append or Last Modified.
2) Col (Check-column)
Basically, it specifies the column that should be examined to find out the rows to be imported.
3) Value (last-value)
It denotes the maximum value of the check column from the previous import operation.

Que 13. What is the significance of using –compress-codec parameter?

Ans. However, we use the –compress -code parameter to get the out file of a sqoop import in formats other than .gz like .bz2.

Que 14. Can free-form SQL queries be used with Sqoop import command? If yes, then how can they be used?

Ans. In Sqoop, we can use SQL queries with the import command. Basically, we should use import command with the –e and – query options to execute free-form SQL queries. But note that the –target dir value must be specified While using the –e and –query options with the import command.

Que 15. What is the importance of eval tool?

Ans. Basically, Sqoop Eval helps to run sample SQL queries against Database as well as preview the results on the console. Moreover, it helps to know what data we can import or that desired data is imported or not.

Que 16. How can you import only a subset of rows from a table?

Ans. In the sqoop import statement, by using the WHERE clause we can import only a subset of rows.

Que 17. What are the limitations of importing RDBMS tables into Hcatalog directly?

Ans. By making use of –hcatalog –database option with the –hcatalog –table, we can import RDBMS tables into Hcatalog directly. However, there is one limitation to it is that it does not support several arguments like –as-Avro file, -direct, -as-sequencefile, -target-dir , -export-dir.
To learn Sqoop HCatalog in detail, follow this link.

Que 18. What is the advantage of using –password-file rather than -P option while preventing the display of password in the sqoop import statement?

Ans.  Inside a sqoop script, we can use The –password-file option. Whereas the -P option reads from standard input, preventing automation.

Que 19. What do you mean by Free Form Import in Sqoop?

Ans. By using any SQL Sqoop can import data from a relational database query rather than only using table and column name parameters.

Que 20. What is the role of JDBC driver in Sqoop?

Ans. Basically, sqoop needs a connector to connect to different relational databases. Since, as a JDBC driver, every DB vendor makes this connector available which is specific to that DB. Hence, to interact with Sqoop needs the JDBC driver of each of the database it needs.

Sqoop Interview Questions for Freshers are Q. 11,14,15,16,19

Sqoop Interview Questions for Experience are Q. 12,13,17,18,20

Que 21. Is JDBC driver enough to connect sqoop to the databases?

Ans. No. to connect to a database Sqoop needs both JDBC and connector.
To learn Sqoop Connector in detail, follow this link.

Que 22. What is InputSplit in Hadoop?

Ans. Input Split is defined as while a Hadoop job runs, it splits input files into chunks also assign each split to a mapper to process.

Que 23. What is the work of Export in Hadoop sqoop?

Ans. Export tool transfer the data from HDFS to RDBMS
To learn Sqoop Export in detail, follow this link.

Que 24. Use of Codegen command in Hadoop sqoop?

Ans. Basically, Codegen command generates code to interact with database records
To learn Sqoop Codegen in detail, follow this link.

Que 25. Use of Help command in Hadoop sqoop?

Ans. Help command in Hadoop sqoop generally list available commands

Que 26.  How can you schedule a sqoop job using Oozie?

Ans. However, Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.
To learn Sqoop Job in detail, follow this link.

Que 27. What is the importance of — the split-by clause in running parallel import tasks in sqoop?

Ans. In Sqoop, it mentions the column name based on whose value the data will be divided into groups of records. Further, by the MapReduce tasks, these group of records will be read in parallel.

Que 28. What is a sqoop metastore?

Ans. A tool that Sqoop hosts a shared metadata repository is what we call sqoop metastore. Moreover, multiple users and/or remote users can define and execute saved jobs (created with the sqoop job) defined in this metastore.
In addition, with the –meta-connect argument Clients must be configured to connect to the metastore in sqoop-site.xml.
To learn Sqoop Job in detail, follow this link.

Que 29. What is the purpose of sqoop-merge?

Ans. The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.
To learn Sqoop Merge in detail, follow this link.

Que 30. How can you see the list of stored jobs in sqoop metastore?

Ans. sqoop job –list

Sqoop Interview Questions for Freshers are Q. 21,22,23,25,28

Sqoop Interview Questions for Experience are Q. 24,26,27,29,30

Que 31. Which database the sqoop metastore runs on?

Ans. Basically, on the current machine running sqoop-metastore launches, a shared HSQLDB database instance.

Que 32. Where can the metastore database be hosted?

Ans. Anywhere, it means we can host metastore database within or outside of the Hadoop cluster.

Que 33. Give the sqoop command to see the content of the job named myjob?

Ans. Sqoop job –show myjob

Que 34. How can you control the mapping between SQL data types and Java types?

Ans. we can configure the mapping between by using the –map-column-java property.
For example:
$ sqoop import … –map-column-java id = String, value = Integer
To learn Java Data types in detail, follow this link.

Que 35. Is it possible to add a parameter while running a saved job?

Ans. Yes, by using the –exec option we can add an argument to a saved job at runtime.
sqoop job –exec jobname — — newparameter

Que 36. What is the usefulness of the options file in sqoop.

Ans. To specify the command line values in a file and use it in the sqoop commands we use the options file in sqoop.
For example
The –connect parameter’s value and –user name value scan be stored in a file and used again and again with different sqoop commands.

Que 37. How can you avoid importing tables one-by-one when importing a large number of tables from a database?

Ans. Using the command
sqoop import-all-tables
–connect
–usrename
–password
–exclude-tables table1,table2 ..
Basically, this will import all the tables except the ones mentioned in the exclude-tables clause.
To learn Sqoop import-all-tables in detail, follow this link.

Que 38. How can you control the number of mappers used by the sqoop command?

Ans. To control the number of mappers executed by a sqoop command we use the parameter –num-mappers. Moreover, we should start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers initially may slow down the performance on the database side.

Que 39. What is the default extension of the files produced from a sqoop import using the –compress parameter?

Ans. .gz

Que 40. What is the significance of using –compress-codec parameter?

Ans. We use the –compress -code parameter to get the out file of a sqoop import in formats other than .gz like .bz2.

Sqoop Interview Questions for Freshers are Q. 31,32,33,34,35

Sqoop Interview Questions for Experience are Q. 36,37,38,39,40

Que 41. What is a disadvantage of using –direct parameter for faster data load by sqoop?

Ans. The native utilities used by databases to support faster laod do not work for binary data formats like SequenceFile.

Que 42. How will you update the rows that are already exported?

Ans. Basically, to update existing rows we can use the parameter –update-key. Moreover, in it, a comma-separated list of columns is used which uniquely identifies a row. All of these columns are used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.

Que 43. What are the basic commands in Apache Sqoop and its uses?

Ans. The basic commands of Apache Sqoop are:
Codegen, Create-hive-table, Eval, Export, Help, Import, Import-all-tables, List-databases, List-tables, Versions.
Moreover, uses of Apache Sqoop basic commands are:

  1. Codegen- It helps to generate code to interact with database records.
  2. Create- hive-table- It helps to Import a table definition into a hive
  3. Eval- It helps to evaluate SQL statement and display the results
  4. Export- It helps to export an HDFS directory into a database table
  5. Help- It helps to list the available commands
  6. Import- It helps to import a table from a database to HDFS
  7. Import-all-tables- It helps to import tables from a database to HDFS
  8. List-databases- It helps to list available databases on a server
  9. List-tables- It helps to list tables in a database
  10. Version- It helps to display the version information

Que 44. How Sqoop word came? Sqoop is which type of tool and the main use of sqoop?

Ans. Sqoop word came from SQL+HADOOP=SQOOP.
Basically, it is a data transfer tool. We use Sqoop to import and export a large amount of data from RDBMS to HDFS and vice versa.
Follow this link to know more about Sqoop

Que 45. What is Sqoop Validation?

Ans. It means to validate the data copied. Either import or export by comparing the row counts from the source as well as the target post copy. Likewise, we use this option to compare the row counts between source as well as the target just after data imported into HDFS. Moreover, While during the imports, all the rows are deleted or added, Sqoop tracks this change. Also updates the log file.

Learn all insights of Sqoop Validation, follow the link: Sqoop Validation – Interfaces & Limitations of Sqoop Validate 

Que 46. What is Purpose to Validate in Sqoop?

Ans. In Sqoop to validate the data copied is Validation main purpose. Basically, either Sqoop import or Export by comparing the row counts from the source as well as the target post copy.

Que 47. What is Sqoop Job?

Ans. To perform an incremental import if a saved job is configured, then state regarding the most recently imported rows is updated in the saved job. Basically, that allows the job to continually import only the newest rows.
Learn all insights of Sqoop job, follow the link: Sqoop- Introduction to Sqoop Job Tutorial   

Que 48. What is Sqoop Import Mainframe Tool and its Purpose?

Ans. Basically, a tool which we use to import all sequential datasets in a partitioned dataset (PDS) on a mainframe to HDFS is Sqoop Import Mainframe. That tool is what we call import mainframe tool. Also, A PDS is akin to a directory on the open systems. Likewise, in a dataset, the records can only contain character data. Moreover here, records will be stored as a single text field with the entire record.
Learn all insights of Sqoop Import Mainframe, follow the link: Learn Sqoop Import Mainframe Tool – Syntax and Examples

Que 49. What is the purpose of Sqoop List Tables?

Ans. Basically, the main purpose of sqoop-list-tables is list tables present in a database.
Learn all insights of Sqoop List Tables, follow the link: Sqoop List Tables – Arguments and Examples 

Que 50. Difference Between Apache Sqoop vs Flume.

Ans. So, let’s discuss all the differences on the basis of features.

a. Data Flow
Apache Sqoop – Basically, Sqoop works with any type of relational database system (RDBMS) that has the basic JDBC connectivity. Also, Sqoop can import data from NoSQL databases like MongoDB, Cassandra and along with it. Moreover, it allows data transfer to Apache Hive or HDFS.
Apache Flume– Likewise, Flume works with streaming data sources those are generated continuously in Hadoop environments. Like log files.

b. Type of Loading
Apache Sqoop – Basically,  Sqoop load is not driven by events.
Apache Flume – Here, data loading is completely event-driven.

c. When to use
Apache Sqoop – However, if the data is being available in Teradata, Oracle, MySQL, PostreSQL or any other JDBC compatible database it is considered an ideal fit.
Apache Flume – While we move bulk of streaming data from sources likes JMS or spooling directories, it is the best choice.

d. Link to HDFS
Apache Sqoop – Basically, for importing data in Apache Sqoop, HDFS is the destination
Apache Flume – In Apache Flume, data generally flow to HDFS through channels

e. Architecture 
Apache Sqoop – Basically, it has connector based architecture. However, that means the connectors know a great deal in connecting with the various data sources. Also to fetch data correspondingly.
Apache Flume – However, it has agent-based architecture. Basically, it means code written in Flume is we call agent that may responsible for fetching the data.
Also, learn complete comparison, follow link Apache Sqoop vs Flume- Comparison

Sqoop Interview Questions for Freshers are Q. 41,42,43,44,45

Sqoop Interview Questions for Experience are Q. 46,47,48,49,50

So, this was all in Apache Sqoop Interview Questions.

Conclusion – Best Sqoop Interview Questions

As a result, we have seen top 50 Sqoop Interview Questions and Answer. We hope all these questions will help you in preparing well for your interviews ahead. Still, if you want to ask any query regarding Sqoop Interview Questions, feel free to ask in the comment section.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google

follow dataflair on YouTube

2 Responses

  1. Kiran says:

    How to choose the number of mappers for a perticular data ?

  2. Harshita Vishwakarma says:

    Can you please elaborate the explanation of Question 18. Its not clear

Leave a Reply

Your email address will not be published. Required fields are marked *