Impala UDF (User-Defined Functions) – How to Write UDFs

Boost your career with Free Big Data Courses!!

In this Impala Tutorial, we will learn the whole concept of Apache Impala UDF (User-Defined Functions). Apart from Impala UDF introduction, we will also learn, how to write Impala User-Defined Functions (UDFs) in detail.

In addition, we will see limitations of Impala UDFs to prevent any error while using it. At last, we will cover some Impala UDF samples.

So, let’s start Impala UDF (User-Defined Functions)

Impala User-Defined Functions (UDFs)

In order to code our own application logic for processing column values during an Impala query, we use User-Defined Functions. Impala User-defined functions are frequently abbreviated as UDFs.

For example, using an external math library- a UDF could perform calculations, also it can combine several column values into one, it can perform geospatial calculations, or other kinds of tests and transformations especially those are outside the scope of the built-in SQL operators and functions.

In other words, to simplify query logic when producing reports, or in order to transform data in flexible ways while using INSERT … SELECT syntax to copy from one table to another, we can use UDFs.

Also, under names stored functions or stored routines this feature is available in other database products.

In Impala 1.2 and higher, Impala support for UDF is available:

  • Using UDFs in a query required using the Hive shell, in Impala 1.1.
  • After Impala 1.2, we can run both  Java-based Hive UDFs that you might already have written and high-performance native code UDFs written in C++.
  • Impala UDAFs can run and return a value based on a set of rows and scalar UDFs that return a single value for each row of the result set.

Note: There is no support for User-Defined Table Functions (UDTFs) or window functions, in Impala currently.

Impala UDF Concepts

Basically, we can write all-new functions on the basis of our use case. Moreover, it is possible to reuse Java UDFs which we have already written for Hive. However, for producing results one row at a time, we can code either scalar functions or more complex aggregate functions.

Impala UDF Concepts

Impala UDF Concepts

a. Impala UDFs and UDAFs

In Impala UDF, we write might accept or produce different numbers of input and output values, on the basis of our use case:

One of the most general forms of UDF takes a single input value and returns a single output value. However, it is called once for each row in the result set, while used in a query.
For example:

select Employee_name, is_frequent_Employee(Employee_id) from Employees;
select obfuscate(sensitive_column) from sensitive_data;
  • Although, a (UDAF) returns a single value after accepting a group of values. 

For example:
— It evaluates multiple rows, however, returns a single value.

select most_profitable_location(store_id, sales, expenses, tax_rate, depreciation) from franchise_data group by year;
select closest_Hotel(latitude, longitude) from places;

— Evaluates batches of rows and returns a separate value for each batch.

b. Native Impala UDF

In addition to supporting existing Hive UDFs written in Java, Impala supports UDFs written in C++ as well. However, we use C++ UDFs while practical.

The reason behind it is the compiled native code can yield higher performance because of UDF execution time often 10x faster for a C++ on comparing to Java UDF.

c. Using Hive UDF with Impala

There is a flexibility that User-Defined Functions (UDFs), which originally written for Hive, Impala can run them, even with no changes, but only subject to the several conditions:

  • It is must that the parameters and return value all should use scalar data types which are supported by Impala. For example, complex or nested types.
  • Moreover, Impala does not support Hive UDFs that accept or return the TIMESTAMP.
  • Here, both Hive UDAFs and UDTFs are not supported.
  • UDF execution time often 10x faster for a C++ on comparing to Java UDF.

Install – Impala UDF & Development Package

Initially, download and install the impala-udf-devel package or impala-udf-dev, in order to develop Impala UDF. There are header files, sample source, and build configuration files, in this package.

  • For our operating system version, locate the appropriate .repo or list file.
  • Specify impala-udf-devel or impala-udf-dev, for the package name.

In addition, there is an advantage that it is not necessary that UDF development code relies on Impala being installed on the same machine. Because it is possible to write and compile UDFs on a minimal development system, and further deploy them on a different one for use with Impala.

How to Write Impala UDF?

Follow these steps while writing Impala UDFs:

  • Once we transfer values from the high-level SQL to your lower-level Impala UDF code, remember the data type differences.
  • For function-oriented programming, use best practices, like :
  1. Select arguments carefully.
  2. Try to avoid side effects.
  3. Also, make each function do a single thing.
Impala UDF

How to Write Impala

a. Getting Started with UDF Coding

At very first, examine the header file /usr/include/impala_udf/udf.h, in order to understand the layout and member variables and functions of the predefined UDF data types:

// This is the only Impala header required to develop UDFs and UDAs. This header
// contains the types that need to be used and the FunctionContext object. The context
// object serves as the interface object between the UDF/UDA and the impala process

However, see the header file udf-sample.h within the sample build environment, that describes a simple function named AddUdf(), for the basic declarations needed to write a scalar UDF:

#ifndef IMPALA_UDF_SAMPLE_UDF_H
#define IMPALA_UDF_SAMPLE_UDF_H
#include <impala_udf/udf.h>
using namespace impala_udf;
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2);
#endif

Also, see the source file udf-sample.cc within the sample build environment, for sample C++ code for a simple function named AddUdf():

#include "udf-sample.h"
// In this sample, we are declaring a UDF that adds two ints and returns an int.
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2) {
 if (arg1.is_null || arg2.is_null) return IntVal::null();
 return IntVal(arg1.val + arg2.val);
}
// Multiple UDFs can be defined in the same file

b. Variable-Length Argument Lists

However, there are a fixed number of arguments in Impala UDF, in the signature of our C++ function, with each one named explicitly. Furthermore, our function can also accept additional optional arguments, where all of the same type.

For example, let’s concatenate two strings, three strings, four strings, and so on or instead of that we can also compare two numbers, three numbers, four numbers, and so on.
To accept a variable-length argument list, code the signature of your function like this:

StringVal Concat(FunctionContext* context, const StringVal& separator,
 int num_var_args, const StringVal* args);

After the type of the first optional argument, in the CREATE FUNCTION statement, include … to indicate it could be followed by more arguments of the same type.

For example, followed by one or more additional STRING arguments, the following function accepts a STRING argument:

[localhost:21000] > create function my_concat(string, string ...) returns string location '/user/test_user/udfs/sample.so' symbol='Concat';

The call from the SQL query must pass at least one argument to the variable-length portion of the argument list.

When Impala calls the function, it fills in the initial set of required arguments, then passes the number of extra arguments and a pointer to the first of those optional arguments.

c. Handling NULL Values

It is essential for each Impala UDF to handle all situations where any NULL values are passed to our function, especially for correctness, performance, and reliability. 

In addition, we can construct a null instance of the return type by using its null() member function, if our function returns NULL when passed a NULL value. Even in other cases like when a search string does not found, we can do it.

d. Memory Allocation for Impala UDF

Before the query is finished, memory allocated within a Impala UDF is deallocated when the function exits, by default. Also, we can refer to input arguments in the expressions for our return values, because the input arguments remain allocated for the lifetime of the function. 

e. Thread-Safe Work Area for Impala UDF

If we specify the optional PREPARE_FN and CLOSE_FN clauses on the CREATE FUNCTION statement, the performance of UDFs can improve.

Here, the “prepare” function refers to the function which sets up a thread-safe data structure in memory which we also can use as a work area whereas the “close” function refers to the function which deallocates that memory.

f. Error Handling for Impala UDFs

Error Handling for UDFs means, in order to prevent errors for UDFs we can call functions which are members of the initial FunctionContext* argument passed to our function.

Moreover, there are some conditions which indicate minor, recoverable problems which do not cause the query to stop, a UDF can record one or more warnings for that.

bool AddWarning(const char* warning_msg);

Also, Impala UDF can set an error flag that prevents the query from returning any results, for a serious problem that requires canceling the query.

void SetError(const char* error_msg); 
bool AddWarning(const char* warning_msg);

Also, Impala UDF can set an error flag that prevents the query from returning any results, for a serious problem that requires canceling the query.

void SetError(const char* error_msg);

Performance Considerations for Impala UDFs

The performance of each Impala UDF is a critical factor in the speed of the overall ETL or ELT pipeline, since Impala UDF typically processes each row of a table, potentially being called billions of times.

When the function is called over and over when processing a huge result set even small optimizations, we can make within the function body can return us big results.

Impala UDF Security Considerations 

  • It is must to have the required read privilege for any databases and tables used in the query, in order to call an Impala UDF in a query.
  • Impala UDFs which are incorrectly coded may cause performance or capacity issues.

Limitations & Restrictions for Impala User-Defined Functions

  • Hive UDFs which accept or return composite or nested types, or even other types which are not available in Impala tables, Impala does not support them.
  • Through Impala, we cannot call Hive current_user() function from a Java UDF.
  • Even, it does not support user-defined table functions (UDTFs) currently.
  • Moreover, it is not possible to use the CHAR and VARCHAR types as input arguments or return values for UDFs.

So, this was all about Impala UDF. Hope you like our explanation.

Conclusion – Impala UDF

As a result, we have seen the whole concept of Impala user-defined functions (UDFs). In addition, we discuss how to write Impala UDF and install Impala UDFs.

In conclusion, we saw performance & security considerations for impala UDF with some limitations of Impala user-defined functions. Furthermore, if you have a doubt in Impala UDF, feel free to ask in comment box.

You give me 15 seconds I promise you best tutorials
Please share your happy experience on Google

follow dataflair on YouTube

2 Responses

  1. Shijin says:

    Imapala 2.3 or higher does support window functions right?

  2. Sumanta Sen says:

    Can we do impala UDF to perform something like this?

    This is a postgres equivalent.

    CREATE FUNCTION add(integer, integer) RETURNS integer
    AS ‘select $1 + $2;’
    LANGUAGE SQL
    IMMUTABLE
    RETURNS NULL ON NULL INPUT;

Leave a Reply

Your email address will not be published. Required fields are marked *