Apache Pig Architecture – Learn Pig Hadoop Working


1. Apache Pig Architecture

In order to write a Pig script, we do require a Pig Latin language. Moreover, we need an execution environment to execute them. So, in this article “Introduction to Apache Pig Architecture”, we will study the complete architecture of Apache Pig. It includes its components, Pig Latin Data Model and Pig Job Execution Flow in depth.

Pig Architecture

Apache Pig Architecture

2. What is Apache Pig Architecture?

In Pig, there is a language we use to analyze data in Hadoop. That is what we call Pig Latin. Also, it is a high-level data processing language that offers a rich set of data types and operators to perform several operations on the data.

Moreover, in order to perform a particular task, programmers need to write a Pig script using the Pig Latin language and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded) using Pig. To produce the desired output, these scripts will go through a series of transformations applied by the Pig Framework, after execution.

Further, Pig converts these scripts into a series of MapReduce jobs internally. Therefore it makes the programmer’s job easy. Here, is the architecture of Apache Pig.

Pig Architecture

Architecture of Apache Pig

3. Apache Pig Components

There are several components in the Apache Pig framework. Let’s study these major components in detail:

i. Parser

At first, all the Pig Scripts are handled by the Parser. Parser basically checks the syntax of the script, does type checking, and other miscellaneous checks. Afterwards, Parser’s output will be a DAG (directed acyclic graph) that represents the Pig Latin statements as well as logical operators.

The logical operators of the script are represented as the nodes and the data flows are represented as edges in DAG (the logical plan)

ii. Optimizer

Afterwards, the logical plan (DAG) is passed to the logical optimizer. It carries out the logical optimizations further such as projection and push down.

iii. Compiler

Then compiler compiles the optimized logical plan into a series of MapReduce jobs.

iv. Execution engine

Eventually, all the MapReduce jobs are submitted to Hadoop in a sorted order. Ultimately, it produces the desired results while these MapReduce jobs are executed on Hadoop.

4. Pig Latin Data Model

Pig architecture

Apache Pig architecture – Pig Latin Data Model

Pig Latin data model is fully nested. Also, it allows complex non-atomic data types like map and tuple. Let’s discuss this data model in detail:

i. Atom

Atom is defined as any single value in Pig Latin, irrespective of their data. Basically, we can use it as string and number and store it as the string. Atomic values of Pig are int, long, float, double, char array, and byte array. Moreover, a field is a piece of data or a simple atomic value in Pig.

For Example − ‘Shubham’ or ‘25’

ii. Tuple

Tuple is a record that is formed by an ordered set of fields. However, the fields can be of any type. In addition, a tuple is similar to a row in a table of RDBMS.

For Example − (Shubham, 25)

iii. Bag

An unordered set of tuples is what we call Bag. To be more specific, a Bag is a collection of tuples (non-unique). Moreover, each tuple can have any number of fields (flexible schema). Generally, we represent a bag by ‘{}’.

For Example − {(Shubham, 25), (Pulkit, 35)}

In addition, when a bag is a field in a relation, in that way it is known as the inner bag.

Example − {Shubham, 25, {9826022258, Shubham@gmail.com,}}

iv. Map

A set of key-value pairs is what we call a map (or data map). Basically, the key needs to be of type char array and should be unique. Also, the value might be of any type. And, we represent it  by ‘[]’

For Example − [name#Shubham, age#25]

v. Relation

A bag of tuples is what we call Relation. In Pig Latin, the relations are unordered. Also, there is no guarantee that tuples are processed in any particular order.

5. Conclusion

As a result, we have seen the whole Apache Pig Architecture in detail. Still, if you want to ask any query about Apache Pig Architecture, feel free to ask in the comment section.

For reference