Apache Pig Reading Data and Storing Data Operators
For the purpose of Reading and Storing Data, there are two operators available in Apache Pig. Such as Load Operator and Store Operator.
So, in this article “Apache Pig Reading Data and Storing Data Operators”, we will cover the whole concept of Pig Reading Data and Storing Data with load and Store Operators. Also, we will cover their syntax as well as their examples to understand it well.
So, let’s understand Apache Pig Reading Data and Storing Data Operators.
Read:Â Apache Pig Built-in Functions Cheat Sheet
What is Apache Pig Reading Data and Storing Data?
As we all know, generally, Apache Pig works on top of Hadoop. To define, Pig is an analytical tool that analyzes large datasets that exist in the Hadoop File System.
However, we have to initially load the data into Apache Pig, in order to analyze data using Apache Pig. For that, we use the load Operator. Afterwards, using the Store Operator, we can store the loaded data in the file system.
i. LOAD
Basically, Load Operator loads data from the file system.
a. Syntax
LOAD 'data' [USING function] [AS schema];
b. Terms
1. ‘data’
It signifies the name of the file or directory, in single quotes.
Once we specify a directory name, all the files in the directory are loaded.
In addition, to specify files at the file system or directory levels we can use Hadoop-supported globing.
2. USING
Keyword.
- The default load function PigStorage is used if the USING clause is omitted.
3. function
The load function.
- Also, we can use a built-in function. Basically, PigStorage is the default load function. That does not need to be specified (simply omit the USING clause).
- However, we can write our own load function, if our data is in a format that cannot be processed by the built-in functions.
4. AS
Keyword
5. Schema
Generally, enclosed in parentheses, a schema using the AS keyword.
- Basically, schema specifies the type of the data which the loader produces. Make sure depending on the loader, if the data does not conform to the schema, there are two possibilities, either a null value or an error is generated.
Also, very important to note that the loader may not immediately convert the data to the specified format, for performance reasons. Although, still we can operate on the data assuming the specified type.
c. Usage
To load data from the file system, we use the LOAD operator.
d. Examples
Example of LOAD in Pig Reading DataÂ
Let’s assume we have a data file called file1.txt. Moreover, the fields are tab-delimited. And, the records are newline-separated.
1 2 3
4 2 1
8 3 4
Here, the default load function, PigStorage, loads data from file1.txt to form relation A. Further, the two LOAD statements are equivalent. However, note this, the fields are not named and all fields default to type bytearray, because no schema is specified.
A = LOAD 'file1.txt'; A = LOAD 'file1.txt' USING PigStorage('\t'); DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
So, here a schema is specified using the AS keyword. And, the two LOAD statements are equivalent. Moreover, to view the schema, also we can use the DESCRIBE and ILLUSTRATE operators.
A = LOAD 'file1.txt' AS (f1:int, f2:int, f3:int); A = LOAD 'file1.txt' USING PigStorage(‘\t’) AS (f1:int, f2:int, f3:int); DESCRIBE A; a: {f1: int,f2: int,f3: int} ILLUSTRATE A;
a | f1: bytearray | f2: bytearray | f3: bytearray |
4 | 2 | 1 |
a | f1: Â int | f2: Â int | f3: Â int |
4 | 2 | 1 |
Let’s explore Top 12 Apache Pig Features you must know
ii. STORE
Basically, STORE operator, Stores or saves results to the file system.
a. Syntax
STORE alias INTO 'directory' [USING function];
b. Terms
1. alias
It is the name of a relation.
2. INTO
That is required keyword.
3. ‘directory’
Here, in quotes, we write the name of the storage directory. However, the STORE operation will fail, if the directory already exists.
Well, the output data files named part-l, are written to this directory.
4. USING
Keyword.
In order to name the store function, we use this clause.
The default store function PigStorage is used if the USING clause is omitted.
5. function
The store function.
- Also, we can use a built-in function. Basically, PigStorage is the default store function. That does not need to be specified (simply omit the USING clause).
- However, we can write our own store function, if our data is in a format that cannot be processed by the built-in functions.Â
c.Usage
Basically, to run (execute) Pig Latin statements, we use the STORE operator. Also, to save (persist) results to the file system, we use it. In addition, for production scripts and batch mode processing, we use STORE.
Note: Â We can use DUMP to check intermediate results, in order to debug scripts during development.
d. Examples
For Example,
Here, by using PigStorage and the asterisk character (*) as the field delimiter, we store the data.
A = LOAD 'data' AS (a1:int,a2:int,a3:int); DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
STORE A INTO 'output1' USING PigStorage ('*'); CAT output1;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3
Here, to format the data before it is stored, we use the CONCAT function.
A = LOAD 'data' AS (a1:int,a2:int,a3:int); DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = FOREACH A GENERATE CONCAT('a:',(chararray)f1), CONCAT('b:',(chararray)f2), CONCAT('c:',(chararray)f3); DUMP B;
(a:1,b:2,c:3)
(a:4,b:2,c:1)
(a:8,b:3,c:4)
(a:4,b:3,c:3)
(a:7,b:2,c:5)
(a:8,b:4,c:3)
STORE B INTO 'output1' using PigStorage(','); CAT output1;
a:1,b:2,c:3
a:4,b:2,c:1
a:8,b:3,c:4
a:4,b:3,c:3
a:7,b:2,c:5
a:8,b:4,c:3
So, this was all about Apache Pig Reading Data and Storing Data Operators. Hope you like our explanation.
Conclusion: Apache Pig Reading Data and Storing Data
As a result, we have seen both the Operators for the purpose of Reading and Storing Data in Apache Pig. Also, we discussed its examples to understand it well. Still, if any doubt occurs, regarding Apache Pig Reading Data and Storing Data, feel free to ask in the comment section.
Your opinion matters
Please write your valuable feedback about DataFlair on Google