Pig UDF | Apache Pig User Defined Functions and Its Types
There is an extensive support for User Defined Functions (UDF’s) in Apache Pig. In this article “Apache Pig UDF”, we will learn the whole concept of Apache Pig UDFs.
Moreover, we will also learn its introduction. In addition, we will discuss types of Pig UDF, way to write as well as the way to use these UDF’s in detail.
What is Apache Pig UDF?
Apache Pig offers extensive support for Pig UDF, in addition to the built-in functions. Basically, we can define our own functions and use them, using these UDF’s. Moreover, in six programming languages, UDF support is available. Such as Java, Jython, Python, JavaScript, Ruby, and Groovy.
However, we can say, complete support is only provided in Java. While in all the remaining languages limited support is provided. In addition, we can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation, using Java.
Although, make sure the UDF’s written using Java language work efficiently as compared to other languages since Apache Pig has been written in Java.
Also, we have a Java repository for UDF’s named Piggybank, in Apache Pig. Basically, we can access Java UDF’s written by other users, and contribute our own UDF’s, using Piggybank.
Types of Pig UDF in Java
We can create and use several types of functions while writing Pig UDF using Java, such as:
a. Filter Functions
In filter statements, we use the filter functions as conditions. Basically, it accepts a Pig value as input and returns a Boolean value.
b. Eval Functions
Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!
In FOREACH GENERATE statements, we use the Eval functions. Basically, it accepts a Pig value as input and returns a Pig result.
c. Algebraic Functions
In a FOREACH GENERATE statement, we use the Algebraic functions act on inner bags. Basically, to perform full MapReduce operations on an inner bag, we use these functions.
Writing Pig UDF using Java
We have to integrate the jar file Pig-0.15.0.jar, in order to write a UDF using Java. However, at first, make sure we have installed Eclipse and Maven in our system because here we are discussing how to write a sample UDF using Eclipse.
So, in order to write a UDF function, follow these steps −
- Firstly, create a new project after opening the Eclipse (say project1).
- Then convert the newly created project into a Maven project.
- Further, copy the following content in the pom.xml. Basically, this file contains the Maven dependencies for Apache Pig and Hadoop – core jar files.
<project xmlns = "http://maven.apache.org/POM/4.0.0" xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>Pig_Udf</groupId> <artifactId>Pig_Udf</artifactId> <version>0.0.1-SNAPSHOT</version> <build> <sourceDirectory>src</sourceDirectory> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.3</version> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.apache.pig</groupId> <artifactId>pig</artifactId> <version>0.15.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>0.20.2</version> </dependency> </dependencies> </project>
- Afterward, save the file and refresh it. We can find the downloaded jar files, in the Maven dependencies section.
- Then create a new class file with name Sample_Eval. Also, copy the following content in it.
import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class Sample_Eval extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; String str = (String)input.get(0); return str.toUpperCase(); } }
However, it is necessary to inherit the EvalFunc class and provide implementation to exec() function, while writing UDF’s. The code required for the UDF is written, within this function. Also, see that we have to return the code to convert the contents of the given column to uppercase, in the above example.
- Right-click on the Sample_Eval.java file just after compiling the class without errors that display us a menu, then select Export.
- Now, we will get the following window, by clicking Export. Then, click on the JAR file.
- Also, by clicking Next> button proceed further. In this way, we will get another window. Through that, we need to enter the path in the local file system. Especially, where we need to store the jar file.
- Now click the Finish button, we can see, a Jar file sample_udf.jar is created, in the specified folder. That jar file contains the UDF written in Java.
Using Pig UDF
Now, follow these steps, after writing the UDF and generating the Jar file −
Step 1: Registering the Jar file
Basically, using the Register operator, we have to register the Jar file that contains the UDF, just after writing UDF (in Java). Also, users can intimate the location of the UDF to Apache Pig, by registering the Jar file.
- Syntax
So, the syntax of the Register operator is-
REGISTER path;
- Example
For Example, let’s register the sample_udf.jar created above.
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below. $cd PIG_HOME/bin $./pig –x local REGISTER '/$PIG_HOME/sample_udf.jar'
It is very important to suppose the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
Using the Define operator, we can define an alias to UDF after registering the UDF.
- Syntax
So, the syntax of the Define operator.
DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
- Example
Let’s define the alias for sample_eval.
DEFINE sample_eval sample_eval();
Step 3: Using the UDF
We can use the UDF same as the built-in functions, after defining the alias. Then assume there is a file named Stu_data in the HDFS /Pig_Data/ directory. It’s content is:
001,Rohit,22,new york 002,Sohail,23,Kolkata 003,Ankur,23,Tokyo 004,Vishal,25,London 005,Dheeraj,23,Bhubaneswar 006,Mitali,22,Chennai 007,Raj,22,new york 008,Somesh,23,Kolkata 009,Mehul,25,Tokyo 010,Shreyash,25,London 011,Shan,25,Bhubaneswar 012,Kajal,22,Chennai
Also, suppose we have loaded this file into Pig.
grunt> Stu_data = LOAD 'hdfs://localhost:9000/pig_data/Stu1.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);
Further, let’s convert the names of the students into the upper case using the UDF sample_eval.
grunt> Upper_case = FOREACH Stu_data GENERATE sample_eval(name);
Afterward, verify the contents of the relation Upper_case.
grunt> Dump Upper_case; (Rohit) (Sohail) (Ankur) (Vishal) (Dheeraj) (Mitali) (Raj) (Somesh) (Mehul) (Shreyash) (Shan) (Kajal)
So, this was all about Apache Pig User Defined Functions. Hope you like our explanation.
Conclusion
As a result, we have seen the whole concept of Apache Pig User Defined Functions (UDF’s). In addition, we discussed types of Pig UDF, Writing Pig UDF with Java. At last, we have learned how to use the Pig UDF. Still, if you want to ask any questions, please ask through the comment section.
Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google
In this example the output doesnt show upper case rather camel case