Pig UDF | Apache Pig User Defined Functions and Its Types

Boost your career with Free Big Data Courses!!

There is an extensive support for User Defined Functions (UDF’s) in Apache Pig. In this article “Apache Pig UDF”, we will learn the whole concept of Apache Pig UDFs.

Moreover, we will also learn its introduction. In addition, we will discuss types of Pig UDF, way to write as well as the way to use these UDF’s in detail.

What is Apache Pig UDF?

Apache Pig offers extensive support for Pig UDF, in addition to the built-in functions. Basically, we can define our own functions and use them, using these UDF’s. Moreover, in six programming languages, UDF support is available. Such as Java, Jython, Python, JavaScript, Ruby, and Groovy.

However, we can say, complete support is only provided in Java. While in all the remaining languages limited support is provided. In addition, we can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation, using Java.

Although, make sure the UDF’s written using Java language work efficiently as compared to other languages since Apache Pig has been written in Java.

Also, we have a Java repository for UDF’s named Piggybank, in Apache Pig. Basically, we can access Java UDF’s written by other users, and contribute our own UDF’s, using Piggybank.

Types of Pig UDF in Java

We can create and use several types of functions while writing Pig UDF using Java, such as:

Pig UDF

Pig UDF – Types of Pig UDF in Java

a. Filter Functions

In filter statements, we use the filter functions as conditions. Basically, it accepts a Pig value as input and returns a Boolean value.

b. Eval Functions

In FOREACH GENERATE statements, we use the Eval functions. Basically, it accepts a Pig value as input and returns a Pig result.

c. Algebraic Functions

In a FOREACH GENERATE statement, we use the Algebraic functions act on inner bags. Basically, to perform full MapReduce operations on an inner bag, we use these functions.

Writing Pig UDF using Java

We have to integrate the jar file Pig-0.15.0.jar, in order to write a UDF using Java. However, at first, make sure we have installed Eclipse and Maven in our system because here we are discussing how to write a sample UDF using Eclipse.
So, in order to write a UDF function, follow these steps −

  • Firstly, create a new project after opening the Eclipse (say project1).
  • Then convert the newly created project into a Maven project.
  • Further, copy the following content in the pom.xml. Basically, this file contains the Maven dependencies for Apache Pig and Hadoop – core jar files.
<project xmlns = "http://maven.apache.org/POM/4.0.0"
  xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>Pig_Udf</groupId>
  <artifactId>Pig_Udf</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <build>
     <sourceDirectory>src</sourceDirectory>
     <plugins>
        <plugin>
           <artifactId>maven-compiler-plugin</artifactId>
           <version>3.3</version>
           <configuration>
              <source>1.7</source>
              <target>1.7</target>
           </configuration>
        </plugin>
     </plugins>
  </build>
  <dependencies>
     <dependency>
        <groupId>org.apache.pig</groupId>
        <artifactId>pig</artifactId>
        <version>0.15.0</version>
     </dependency>
     <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-core</artifactId>
        <version>0.20.2</version>
     </dependency>
  </dependencies>
</project>
  • Afterward, save the file and refresh it. We can find the downloaded jar files, in the Maven dependencies section.
  • Then create a new class file with name Sample_Eval. Also, copy the following content in it.
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class Sample_Eval extends EvalFunc<String>{
  public String exec(Tuple input) throws IOException {
     if (input == null || input.size() == 0)
     return null;
     String str = (String)input.get(0);
     return str.toUpperCase();
  }
}

However, it is necessary to inherit the EvalFunc class and provide implementation to exec() function, while writing UDF’s. The code required for the UDF is written, within this function. Also, see that we have to return the code to convert the contents of the given column to uppercase, in the above example.

  • Right-click on the Sample_Eval.java file just after compiling the class without errors that display us a menu, then select Export.
  • Now, we will get the following window, by clicking Export. Then, click on the JAR file.
  • Also, by clicking Next> button proceed further. In this way, we will get another window. Through that, we need to enter the path in the local file system. Especially, where we need to store the jar file.
  • Now click the Finish button, we can see, a Jar file sample_udf.jar is created, in the specified folder. That jar file contains the UDF written in Java.

Using Pig UDF

Now, follow these steps, after writing the UDF and generating the Jar file −

Step 1: Registering the Jar file
Basically, using the Register operator, we have to register the Jar file that contains the UDF, just after writing UDF (in Java). Also, users can intimate the location of the UDF to Apache Pig, by registering the Jar file.

  • Syntax

So, the syntax of the Register operator is-

REGISTER path;
  • Example

For Example, let’s register the sample_udf.jar created above.

Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin
$./pig –x local
REGISTER '/$PIG_HOME/sample_udf.jar'

It is very important to suppose the Jar file in the path − /$PIG_HOME/sample_udf.jar

Step 2: Defining Alias
Using the Define operator, we can define an alias to UDF after registering the UDF.

  • Syntax

So, the syntax of the Define operator.

DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
  • Example

Let’s define the alias for sample_eval.

DEFINE sample_eval sample_eval();

Step 3: Using the UDF
We can use the UDF same as the built-in functions, after defining the alias. Then assume there is a file named Stu_data in the HDFS /Pig_Data/ directory. It’s content is:

001,Rohit,22,new york
002,Sohail,23,Kolkata
003,Ankur,23,Tokyo
004,Vishal,25,London
005,Dheeraj,23,Bhubaneswar
006,Mitali,22,Chennai
007,Raj,22,new york
008,Somesh,23,Kolkata
009,Mehul,25,Tokyo
010,Shreyash,25,London
011,Shan,25,Bhubaneswar
012,Kajal,22,Chennai

Also, suppose we have loaded this file into Pig.

grunt> Stu_data = LOAD 'hdfs://localhost:9000/pig_data/Stu1.txt' USING PigStorage(',')
  as (id:int, name:chararray, age:int, city:chararray);

Further, let’s convert the names of the students into the upper case using the UDF sample_eval.

grunt> Upper_case = FOREACH Stu_data GENERATE sample_eval(name);

Afterward, verify the contents of the relation Upper_case.

grunt> Dump Upper_case;
(Rohit)
(Sohail)
(Ankur)
(Vishal)
(Dheeraj)
(Mitali)
(Raj)
(Somesh)
(Mehul)
(Shreyash)
(Shan)
(Kajal)

So, this was all about Apache Pig User Defined Functions. Hope you like our explanation.

Conclusion

As a result, we have seen the whole concept of Apache Pig User Defined Functions (UDF’s). In addition, we discussed types of Pig UDF, Writing Pig UDF with Java. At last, we have learned how to use the Pig UDF. Still, if you want to ask any questions, please ask through the comment section.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google

follow dataflair on YouTube

1 Response

  1. Ansuman Satpathy says:

    In this example the output doesnt show upper case rather camel case

Leave a Reply

Your email address will not be published. Required fields are marked *