Hive UDF – User Defined Function with Example
In the last hive tutorial, we studied the Hive View & Index. In this blog, we will learn the whole concept of Apache Hive UDF (User-Defined Function). Also, we will learn Hive UDF example as well as be testing to understand Hive user-defined function well.
What is Hive UDF?
Basically, we can use two different interfaces for writing Apache Hive User Defined Functions.
- Simple API
- Complex API
As long as our function reads and returns primitive types, we can use the simple API (org.apache.hadoop.hive.ql.exec.UDF). In other words, it means basic Hadoop & Hive writable types. Such as Text, IntWritable, LongWritable, DoubleWritable, etc.
Before we proceed, let’s discuss Hive Features & Limitations in detail.
So, let’s discuss each Hive UDF API in detail:
a. Simple API
Basically, with the simpler UDF API, building a Hive User Defined Function involves little more than writing a class with one function (evaluate). However, let’s see an example to understand it well:
Simple API – Hive UDF Example
class SimpleUDFExample extends UDF { Â public Text evaluate(Text input) { Â Â Â return new Text("Hello " + input.toString()); Â } }
i. TESTING SIMPLE Hive UDF
Moreover, we can test it with regular testing tools, like JUnit, since the Hive UDF is simple one function.
public class SimpleUDFExampleTest {   @Test  public void testUDF() {    SimpleUDFExample example = new SimpleUDFExample();    Assert.assertEquals("Hello world", example.evaluate(new Text("world")).toString());  } }
b. Complex API
However, to write code for objects that are not writable types. Like struct, map and array types. Hence the org.apache.hadoop.hive.ql.udf.generic. GenericUDF API offers a way.
In addition, for the function arguments, it needs us to manually manage object inspectors. Also, to verify the number and types of the arguments we receive. To be more specific, an object inspector offers a consistent interface for underlying object types.
Hence, that different object implementation can all be accessed in a consistent way from within hive. For example, we could implement a struct as a Map so long as you provide a corresponding object inspector.
Moreover, with this API we need to implement three methods:
// this is like the evaluate method of the simple API. It takes the actual arguments and returns the result abstract Object evaluate(GenericUDF.DeferredObject[] arguments); // Doesn't really matter, we can return anything but should be a string representation of the function. abstract String getDisplayString(String[] children); // called once, before any evaluate() calls. You receive an array of object inspectors that represent the arguments of the function // this is where you validate that the function is receiving the correct argument types and the correct number of arguments. abstract ObjectInspector initialize(ObjectInspector[] arguments);
To understand this properly,  let’s take an example.
Complex API – Apache Hive UDF Example
Basically, here the creation of a function called containsString. However, it takes two arguments:
- A list of Strings:
- A String
Further, it returns true/false on whether the list contains the string that we offer, for example:
Let’s learn Apache Hive Operators in detail.
containsString(List(“a”, “b”, “c”), “b”); // true
containsString(List(“a”, “b”, “c”), “d”); // false
i. GenericUDF API
However, the GenericUDF API needs a little more boilerplate, unlike with Hive UDF API:
class ComplexUDFExample extends GenericUDF {  ListObjectInspector listOI;  StringObjectInspector elementOI;  @Override  public String getDisplayString(String[] arg0) {    return "arrayContainsExample()"; // this should probably be better  } @Override  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {    if (arguments.length != 2) {      throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T");    }
// 1. Check we received the right object types.    ObjectInspector a = arguments[0];    ObjectInspector b = arguments[1];    if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) {      throw new UDFArgumentException("first argument must be a list / array, second argument must be a string");    }    this.listOI = (ListObjectInspector) a;    this.elementOI = (StringObjectInspector) b; // 2. Check that the list contains strings    if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {      throw new UDFArgumentException("first argument must be a list of strings");    } // the return type of our function is a boolean, so we provide the correct object inspector    return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;  }  @Override  public Object evaluate(DeferredObject[] arguments) throws HiveException {  // get the list and string from the deferred objects using the object inspectors    List<String> list = (List<String>) this.listOI.getList(arguments[0].get());    String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());  // check for nulls    if (list == null || arg == null) {      return null;    } // see if our list contains the value we need  for(String s: list) {      if (arg.equals(s)) return new Boolean(true);    }    return new Boolean(false);  } }
ii. TESTINGÂ Complex Hive UDF
However, in the setup, there is the complex part of testing the function.
public class ComplexUDFExampleTest {  @Test public void testComplexUDFReturnsCorrectValues() throws HiveException { // set up the models we need          ComplexUDFExample example = new ComplexUDFExample();          ObjectInspector stringOI = PrimitiveObjectInspectorFactory.javaStringObjectInspector;          ObjectInspector listOI = ObjectInspectorFactory.getStandardListObjectInspector(stringOI);          JavaBooleanObjectInspector resultInspector = (JavaBooleanObjectInspector) example.initialize(new ObjectInspector[]{listOI, stringOI}); // create the actual UDF arguments          List<String> list = new ArrayList<String>();          list.add("a");          list.add("b");          list.add("c");  // test our results // the value exists          Object result = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("a")});          Assert.assertEquals(true, resultInspector.get(result)); // the value doesn't exist          Object result2 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(list), new DeferredJavaObject("d")});          Assert.assertEquals(false, resultInspector.get(result2));  // arguments are null          Object result3 = example.evaluate(new DeferredObject[]{new DeferredJavaObject(null), new DeferredJavaObject(null)});          Assert.assertNull(result3); } }
So, this was all about Hive User Defined Function Tutorial. Hope you like our explanation user-defined function in Hive.
Conclusion
Hence, we have seen the whole concept of Apache Hive UDF and types of interfaces for writing UDF in Apache Hive: Simple API & Complex API with example. Still, if you have doubt, feel free to ask in the comment section.
Did we exceed your expectations?
If Yes, share your valuable feedback on Google
Hi,
How can we configure required jars in case of extending Generic UDF.
When I was extending UDF class I was able to override getRequiredJars() for configuring the jars.
Any help would be much appreciated.
Regards
Sreeja
Great description. Do you know the way to make a UDF to be non deterministic, e.g. generating a UUID, not having a repeated values could be crucial
what are @Test, @override in the code?
Thanks for in depth explanation of hive udf. Here I have one doubt, My input data type is string but my output data type is Map(key,value).So which object inspector should I use for returning Map type.