Hive UDF – User Defined Function with Example

1. Apache Hive UDF – Objective

In the last hive tutorial, we studied the Hive View & Index. In this blog, we will learn the whole concept of Apache Hive UDF (User-Defined Function). Also, we will learn Hive UDF example as well as be testing to understand Hive user-defined function well.

Introduction to Hive UDF

Introduction to Hive User Defined Function

Hadoop Quiz

2. What is Hive UDF?

Basically, we can use two different interfaces for writing Apache Hive User Defined Functions.

  1. Simple API
  2. Complex API

As long as our function reads and returns primitive types, we can use the simple API (org.apache.hadoop.hive.ql.exec.UDF). In other words, it means basic Hadoop & Hive writable types. Such as Text, IntWritable, LongWritable, DoubleWritable, etc.
Before we proceed, let’s discuss Hive Features & Limitations in detail.
So, let’s discuss each Hive UDF API in detail:

a. Simple API

Basically, with the simpler UDF API, building a Hive User Defined Function involves little more than writing a class with one function (evaluate). However, let’s see an example to understand it well:
Simple API – Hive UDF Example

class SimpleUDFExample extends UDF
          public Text evaluate(Text input)
                     return new Text("Hello " + input.toString());


Moreover, we can test it with regular testing tools, like JUnit, since the Hive UDF is simple one function.

public class SimpleUDFExampleTest
       public void testUDF()
             SimpleUDFExample example = new SimpleUDFExample();
             Assert.assertEquals("Hello world", example.evaluate(new Text("world")).toString());

b. Complex API

However, to write code for objects that are not writable types. Like struct, map and array types. Hence the org.apache.hadoop.hive.ql.udf.generic. GenericUDF API offers a way.
In addition, for the function arguments, it needs us to manually manage object inspectors. Also, to verify the number and types of the arguments we receive. To be more specific, an object inspector offers a consistent interface for underlying object types. Hence, that different object implementation can all be accessed in a consistent way from within hive. For example, we could implement a struct as a Map so long as you provide a corresponding object inspector.
Read about Apache Hive Built-in functions in detail.
Moreover, with this API we need to implement three methods:

// this is like the evaluate method of the simple API. It takes the actual arguments and returns the result
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);
// Doesn't really matter, we can return anything but should be a string representation of the function.
abstract String getDisplayString(String[] children);
// called once, before any evaluate() calls. You receive an array of object inspectors that represent the arguments of the function
// this is where you validate that the function is receiving the correct argument types and the correct number of arguments.
abstract ObjectInspector initialize(ObjectInspector[] arguments);

To understand this properly,  let’s take an example.
Complex API – Apache Hive UDF Example
Basically, here the creation of a function called containsString. However, it takes two arguments:

  1. A list of Strings:
  2. A String

Further, it returns true/false on whether the list contains the string that we offer, for example:
Let’s learn Apache Hive Operators in detail.
containsString(List(“a”, “b”, “c”), “b”); // true
containsString(List(“a”, “b”, “c”), “d”); // false

i. GenericUDF API

However, the GenericUDF API needs a little more boilerplate, unlike with Hive UDF API:

class ComplexUDFExample extends GenericUDF
      ListObjectInspector listOI;
      StringObjectInspector elementOI;
      public String getDisplayString(String[] arg0)
            return "arrayContainsExample()"; // this should probably be better
     public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException
          if (arguments.length != 2)
                   throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T");
// 1. Check we received the right object types.
   ObjectInspector a = arguments[0];
   ObjectInspector b = arguments[1];
   if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector))
         throw new UDFArgumentException("first argument must be a list / array, second argument must be a string");
   this.listOI = (ListObjectInspector) a;
   this.elementOI = (StringObjectInspector) b;
// 2. Check that the list contains strings
   if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector))
         throw new UDFArgumentException("first argument must be a list of strings");
// the return type of our function is a boolean, so we provide the correct object inspector
    return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
  public Object evaluate(DeferredObject[] arguments) throws HiveException
// get the list and string from the deferred objects using the object inspectors
       List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
       String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());  
// check for nulls
      if (list == null || arg == null)
          return null;
// see if our list contains the value we need
       for(String s: list)
             if (arg.equals(s)) return new Boolean(true);
       return new Boolean(false);

ii. TESTING Complex Hive UDF

However, in the setup, there is the complex part of testing the function.

public class ComplexUDFExampleTest
public void testComplexUDFReturnsCorrectValues() throws HiveException
// set up the models we need
         ComplexUDFExample example = new ComplexUDFExample();
         ObjectInspector stringOI = PrimitiveObjectInspectorFactory.javaStringObjectInspector;
         ObjectInspector listOI = ObjectInspectorFactory.getStandardListObjectInspector(stringOI);
         JavaBooleanObjectInspector resultInspector = (JavaBooleanObjectInspector) example.initialize(new ObjectInspector[]{listOI, stringOI}); 
// create the actual UDF arguments
         List<String> list = new ArrayList<String>();
// test our results 
// the value exists
         Object result = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("a")});
         Assert.assertEquals(true, resultInspector.get(result)); 
// the value doesn't exist
         Object result2 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("d")});
         Assert.assertEquals(false, resultInspector.get(result2));  
// arguments are null
         Object result3 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(null), new DeferredJavaObject(null)});

So, this was all about Hive User Defined Function Tutorial. Hope you like our explanation user-defined function in Hive.

Get the most demanding skills of IT Industry - Learn Hadoop

3. Conclusion

Hence, we have seen the whole concept of Apache Hive UDF and types of interfaces for writing UDF in Apache Hive: Simple API & Complex API with example. Still, if you have doubt, feel free to ask in the comment section.
See Also- Apache Hive Architecture & Components
For reference

3 Responses

  1. Sreeja says:

    How can we configure required jars in case of extending Generic UDF.
    When I was extending UDF class I was able to override getRequiredJars() for configuring the jars.
    Any help would be much appreciated.

  2. Boris says:

    Great description. Do you know the way to make a UDF to be non deterministic, e.g. generating a UUID, not having a repeated values could be crucial

  3. Rishika says:

    what are @Test, @override in the code?

Leave a Reply

Your email address will not be published. Required fields are marked *