Apache Pig Built in Functions Cheat Sheet 1


1. Apache Pig Built in Functions

In this article “Apache Pig Built in Functions”, we will discuss all the Apache Pig Built-in Functions in detail. It includes eval, load/store, math, bag and tuple functions and many more. Also, we will see their syntax along with their functions and descriptions to understand them well.

So, let’s start Pig Built in Functions.

Apache Pig Built in Functions Cheat Sheet

Apache Pig Built in Functions Cheat Sheet

2. Pig Functions

There is a huge set of Apache Pig Built in Functions available. Such as the eval, load/store, math, string, date and time, bag and tuple functions. Basically, there are two main properties which differentiate built in functions from user-defined functions (UDFs) such as:

  • We do not need to register built in functions since Pig knows where they are.
  • Also, we do not need to qualify built in functions, while using them, because again Pig knows where to find them.

3. List of Apache Pig Built in Functions

Let’s discuss various Apache Pig Built in Functions namely eval, load, store, math, string, bag, and tuple, one by one in depth.

a. Eval Functions

Eval Functions is the first types of Pig Built in Functions. Here are the Pig Eval functions, offered by Apache Pig.

i. AVG()

  • AVG Syntax

AVG(expression)

We use AVG(), to compute the average of the numerical values within a bag.

  • AVG Example

In this example, the average GPA for each Employee is computed

A = LOAD ‘Employee.txt’ AS (name:chararray, term:chararray, gpa:float);

DUMP A;
(johny,fl,3.9F)
(johny,wt,3.7F)
(johny,sp,4.0F)
(johny,sm,3.8F)
(Mariya,fl,3.8F)
(Mariya,wt,3.9F)
(Mariya,sp,4.0F)
(Mariya,sm,4.0F)

B = GROUP A BY name;

DUMP B;
(johny,{(johny,fl,3.9F),(johny,wt,3.7F),(johny,sp,4.0F),(johny,sm,3.8F)})
(Mariya,{(Mariya,fl,3.8F),(Mariya,wt,3.9F),(Mariya,sp,4.0F),(Mariya,sm,4.0F)})

C = FOREACH B GENERATE A.name, AVG(A.gpa);

DUMP C;
({(johny),(johny),(johny),(johny)},3.850000023841858)
({(Mariya),(Mariya),(Mariya),(Mariya)},3.925000011920929)

ii. BagToString()

This function is used to concatenate the elements of a bag into a string. We can place a delimiter between these values (optional) while concatenating.

iii. CONCAT()

  • The syntax of CONCAT()

CONCAT (expression, expression)

We use this Pig Function to concatenate two or more expressions of the same type.

  • Example of CONCAT()

In this example, fields f1, an underscore string literal, f2 and f3 are concatenated.

X = LOAD ‘data’ as (f1:chararray, f2:chararray, f3:chararray);

DUMP X;
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)

Y = FOREACH X GENERATE CONCAT(f1, ‘_’, f2,f3);

DUMP Y;
(apache_opensource)
(hadoop_mapreduce)
(pig_piglatin)

iv. COUNT()

  • The syntax of COUNT() 

COUNT(expression)

While counting the number of tuples in a bag, we use it to get the number of elements in a bag.

  • Example of COUNT()

In this example, we count the tuples in the bag:

X = LOAD ‘data’ AS (f1:int,f2:int,f3:int);

DUMP X;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

Y = GROUP X BY f1;

DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})

A = FOREACH Y GENERATE COUNT(X);

DUMP A;
(1L)
(2L)
(1L)
(2L)

v. COUNT_STAR()

  • The syntax of COUNT_STAR()

COUNT_STAR(expression)

We can say it is similar to the COUNT() function. To get the number of elements in a bag, we use it.

  • Example of COUNT_STAR()

To count the tuples in a bag.

A = FOREACH Y GENERATE COUNT_STAR(X);

vi. DIFF()

  • The syntax of DIFF()

DIFF (expression, expression)

In order to compare two bags (fields) in a tuple.

  • Example of DIFF()

Here, compares the tuples in two bags.

A = LOAD ‘bag_data’ AS (B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)});

DUMP A;
({(8,9),(0,1)},{(8,9),(1,1)})
({(2,3),(4,5)},{(2,3),(4,5)})
({(6,7),(3,7)},{(2,2),(3,7)})

DESCRIBE A;
a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}

X = FOREACH A GENERATE DIFF(B1,B2);

grunt> dump x;
({(0,1),(1,1)})
({})
({(6,7),(2,2)})

vii. IsEmpty()

  • The Syntax of IsEmpty ()

IsEmpty(expression)

We use this Apache Pig function to check if a bag or map is empty.

  • Example of IsEmpty ()

Here, all Employees with an SSN but no name are located.

SSN = load ‘ssn.txt’ using PigStorage() as (ssn:long);

SSN_NAME = load ‘Employees.txt’ using PigStorage() as (ssn:long, name:chararray);

/* do a cogroup of SSN with SSN_Name */
X = COGROUP SSN by ssn, SSN_NAME by ssn;

/* Keep ssn’s for which there is no name only */
Y = filter X by IsEmpty(SSN_NAME);

viii. MAX()

  • The Syntax of MAX()

MAX(expression)

Basically, to calculate the highest value for a column (numeric values or chararrays) in a single-column bag.

  • Example of MAX()

A = LOAD ‘Employee’ AS (name:chararray, session:chararray, gpa:float);

DUMP A;
(Johny,fl,3.9F)
(Johny,wt,3.7F)
(Johny,sp,4.0F)
(Johny,sm,3.8F)
(Mariya,fl,3.8F)
(Mariya,wt,3.9F)
(Mariya,sp,4.0F)
(Mariya,sm,4.0F)

B = GROUP A BY name;

DUMP B;
(Johny,{(Johny,fl,3.9F),(Johny,wt,3.7F),(Johny,sp,4.0F),(Johny,sm,3.8F)})
(Mariya,{(Mariya,fl,3.8F),(Mariya,wt,3.9F),(Mariya,sp,4.0F),(Mariya,sm,4.0F)})

X = FOREACH B GENERATE group, MAX(A.gpa);
DUMP X;
(Johny,4.0F)
(Mariya,4.0F)

ix. MIN()

  • The Syntax of MIN()

MIN(expression)

In order to get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column bag.

  • Example of MIN()

A = LOAD ‘Employee’ AS (name:chararray, session:chararray, gpa:float);

DUMP A;
(Johny,fl,3.9F)
(Johny,wt,3.7F)
(Johny,sp,4.0F)
(Johny,sm,3.8F)
(Mariya,fl,3.8F)
(Mariya,wt,3.9F)
(Mariya,sp,4.0F)
(Mariya,sm,4.0F)

B = GROUP A BY name;

DUMP B;
(Johny,{(Johny,fl,3.9F),(Johny,wt,3.7F),(Johny,sp,4.0F),(Johny,sm,3.8F)})
(Mariya,{(Mariya,fl,3.8F),(Mariya,wt,3.9F),(Mariya,sp,4.0F),(Mariya,sm,4.0F)})
X = FOREACH B GENERATE group, MIN(A.gpa);

DUMP X;
(Johny,3.7F)
(Mariya,3.8F)

x. PluckTuple()

We can define a string prefix and filter the columns in a relation that begin with the given prefix, using the Pig Latin PluckTuple() function.

  • Example of PluckTuple

a = load ‘a’ as (x, y);
b = load ‘b’ as (x, y);
c = join a by x, b by x;
DEFINE pluck PluckTuple(‘a::’);
d = foreach c generate FLATTEN(pluck(*));
describe c;
c: {a::x: bytearray,a::y: bytearray,b::x: bytearray,b::y: bytearray}
describe d;
d: {plucked::a::x: bytearray,plucked::a::y: bytearray}
DEFINE pluckNegative PluckTuple(‘a::’,’false’);
d = foreach c generate FLATTEN(pluckNegative(*));
describe d;
d: {plucked::b::x: bytearray,plucked::b::y: bytearray}

xi. SIZE()

  • The Syntax of SIZE()

SIZE(expression)

We use this Pig Function in order to compute the number of elements based on any Pig data type.

  • Example of SIZE()

Here, we are computing the number of characters in the first field

A = LOAD ‘data’ as (f1:chararray, f2:chararray, f3:chararray);
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)

X = FOREACH A GENERATE SIZE(f1);

DUMP X;
(6L)
(6L)
(3L)

xii. SUBTRACT()

Basically, to subtract two bags. As a process, it takes two bags as inputs. Then returns a bag which contains the tuples of the first bag that are not in the second bag.

  • Example of SUBTRACT()

It creates a new bag composed of B1 elements that are not in B2.

A = LOAD ‘bag_data’ AS (B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)});

DUMP A;
({(8,9),(0,1),(1,2)},{(8,9),(1,1)})
({(2,3),(4,5)},{(2,3),(4,5)})
({(6,7),(3,7),(3,7)},{(2,2),(3,7)})

DESCRIBE A;
A: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}

X = FOREACH A GENERATE SUBTRACT(B1,B2);

DUMP X;
({(0,1),(1,2)})
({})
({(6,7)})

xiii. SUM()

  • Syntax of SUM()

SUM(expression)

This Function in Pig is to get the total of the numeric values of a column in a single-column bag.

  • Example of SUM()

It creates a new bag composed of B1 elements that are not in B2.

A = LOAD ‘bag_data’ AS (B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)});

DUMP A;
({(8,9),(0,1),(1,2)},{(8,9),(1,1)})
({(2,3),(4,5)},{(2,3),(4,5)})
({(6,7),(3,7),(3,7)},{(2,2),(3,7)})

DESCRIBE A;
A: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}

X = FOREACH A GENERATE SUBTRACT(B1,B2);

DUMP X;
({(0,1),(1,2)})
({})
({(6,7)})
B = GROUP A BY owner;

DUMP B;
(Alice,{(Alice,turtle,1),(Alice,goldfish,5),(Alice,cat,2)})
(Bob,{(Bob,dog,2),(Bob,cat,2)})

X = FOREACH B GENERATE group, SUM(A.pet_num);
DUMP X;
(Alice,8L)
(Bob,4L)

xiv. TOKENIZE()

  • Syntax of TOKENIZE()

TOKENIZE(expression)

For splitting a string (which contains a group of words) in a single tuple. Then return a bag which contains the output of the split operation.

  • Example of TOKENIZE()

The strings in each row are split.

A = LOAD ‘data’ AS (f1:chararray);

DUMP A;
(It is the first string.)
(It is the second string.)
(It is the third string.)

X = FOREACH A GENERATE TOKENIZE(f1);

DUMP X;
({(It),(is),(the),(first),(string.)})
({(It),(is),(the),(second),(string.)})
({(It),(is),(the),(third),(string.)})

Here, a field delimiter is specified.

{code}
A = LOAD ‘data’ AS (f1:chararray);
B = FOREACH A GENERATE TOKENIZE (f1,’||’);
DUMP B;
{code}

b. Load and Store Functions

To determine, how data goes into Pig and comes out of Pig we use Load/store functions – Pig Built in Functions. Also, we can write your own load/store functions. There is a set of built-in load/store functions such as:

i. PigStorage()

  • Syntax of PigStorage()

PigStorage(field_delimiter)

In order to load and store structured files.

  • Example of PigStorage()

Here, PigStorage expects input.txt to contain tab-separated fields as well as newline-separated records. So, statements are equivalent.

A = LOAD ‘Employee’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);

A = LOAD ‘Employee’ AS (name: chararray, age:int, gpa: float);

Also, here, PigStorage stores the contents of X into files along with fields that are delimited with an asterisk ( * ). And, the STORE statement specifies that the files will be located in a directory named output and that the files will be named part-nnnnn (for example, part-00000).

STORE X INTO ‘output’ USING PigStorage(‘*’);

Here, PigStorage loads data with complex data type, a bag of map and double.

a = load ‘1.txt’ as (a0:{t:(m:map[int],d:double)});

{([foo#1,bar#2],34.0),([white#3,yellow#4],45.0)} : valid
{([foo#badint],baddouble)} : conversion fail for badint/baddouble, get {([foo#],)}
{} : valid, empty bag

ii. TextLoader()

  • TextLoader() Syntax 

TextLoader()

This Pig Function is used for loading unstructured data into Pig.

  • Example of TextLoader()

Here, TextLoader is used with the LOAD function.

A = LOAD ‘data’ USING TextLoader();

iii. BinStorage()

  • BinStorage() Syntax 

BinStorage()

By using machine-readable format, for loading and storing data into Pig.

  • Example of BinStorage()

Here BinStorage is used with the LOAD and STORE functions.

A = LOAD ‘data’ USING BinStorage();
STORE X into ‘output’ USING BinStorage();

Here BinStorage is used to load multiple locations.

A = LOAD ‘input1.bin, input2.bin’ USING BinStorage();

raw = load ‘sampledata’ using BinStorage() as (col1,col2, col3);
–filter out null columns
A = filter raw by col1#’bcookie’ is not null;

B = foreach A generate col1#’bcookie’ as reqcolumn;
describe B;
–B: {regcolumn: bytearray}
X = limit B 5;
dump X;
(36co9b55onr8s)
(36co9b55onr8s)
(36hilul5oo1q1)
(36hilul5oo1q1)
(36l4cj15ooa8a)

B = foreach A generate (chararray)col1#’bcookie’ as convertedcol;
describe B;
–B: {convertedcol: chararray}
X = limit B 5;
dump X;
()
()
()
()
()

iv. Handling Compression

We can load and store compressed data in Pig Latin.

c. Bag and Tuple Functions

Here is the list of Bag and Tuple – Pig Built in functions. Such as:

i. TOBAG()

  • Syntax of TOBAG()

TOBAG(expression [, expression …])

This Pig Built in function is used in order to convert two or more expressions into a bag.

  • Example of TOBAG()

Here, fields f1 and f3 are converted to tuples that are then placed in a bag.

a = LOAD ‘Employee’ AS (f1:chararray, f2:int, f3:float);
DUMP a;

(Johny,18,4.0)
(Mariya,19,3.8)
(Billy,20,3.9)
(joey,18,3.8)

b = FOREACH a GENERATE TOBAG(f1,f3);
DUMP b;

({(Johny),(4.0)})
({(Mariya),(3.8)})
({(Billy),(3.9)})
({(joey),(3.8)})

ii. TOP()

  • The syntax of Top()

TOP(topN,column,relation)

Forgetting the top N tuples of a relation.

  • Example of TOP()

Here the top 10 occurrences are returned.

DEFINE asc TOP(‘ASC’); — ascending order
DEFINE desc TOP(‘DESC’); — descending order

A = LOAD ‘data’ as (first: chararray, second: chararray);
B = GROUP A BY (first, second);
C = FOREACH B generate FLATTEN(group), COUNT(A) as count;
D = GROUP C BY first; — again group by first
topResults = FOREACH D {
result = asc(10, 1, C); — and retain top 10 (in ascending order) occurrences of ‘second’ in first
GENERATE FLATTEN(result);
}

bottomResults = FOREACH D {
result = desc(10, 1, C); — and retain top 10 (in descending order) occurrences of ‘second’ in first
GENERATE FLATTEN(result);
}

iii. TOTUPLE()

  • The Syntax of TOTUPLE()

TOTUPLE(expression [, expression …])

In order to convert one or more expressions into a tuple.

  • Example of TOTUPLE()

Here, fields f1, f2 and f3 are converted to a tuple.

a = LOAD ‘Employee’ AS (f1:chararray, f2:int, f3:float);
DUMP a;

(Johny,18,4.0)
(Mariya,19,3.8)
(Billy,20,3.9)
(joey,18,3.8)

b = FOREACH a GENERATE TOTUPLE(f1,f2,f3);
DUMP b;
((Johny,18,4.0))
((Mariya,19,3.8))
((Billy,20,3.9))
((joey,18,3.8))

iv. TOMAP()

Forgetting to convert the key-value pairs into a Map.

  • Example of TOMAP()

Here, Employee names (type chararray) and Employee GPAs (type float) are used to create three maps.

A = load ‘Employees’ as (name:chararray, age:int, gpa:float);
B = foreach A generate TOMAP(name, gpa);
store B into ‘results’;

Input (Employees)
joey smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1

Output (results)
[joey smith#3.5]
[amy chen#3.2]
[leo allen#2.1]

d. String Functions

Here, is the list of String – Pig Built in functions. Such as:

i. ENDSWITH(string, testAgainst)

For verifying whether a given string ends with a particular substring.

ii. STARTSWITH(string, substring)

This Pig Function verifies whether the first string starts with the second, after accepting two string parameters.

iii. SUBSTRING(string, startIndex, stopIndex)

  • Syntax

SUBSTRING(string, startIndex, stopIndex)

It returns a substring from a given string.

iv. EqualsIgnoreCase(string1, string2)

In order to compare two strings ignoring the case.

v. INDEXOF(string, ‘character’, startIndex)

  • Syntax

INDEXOF(string, ‘character’, startIndex)

It returns the first occurrence of a character in a string, searching forward from a start index.

vi. LAST_INDEX_OF(expression)

  • Syntax

LAST_INDEX_OF(expression)

To return the index of the last occurrence of a character in a string, searching backward from a start index.

vi. LCFIRST(expression)

  • Syntax

LCFIRST(expression)

This Pig Function is used for conversion of the first character in a string to lowercase.

vii. UCFIRST(expression)

  • Syntax

UCFIRST(expression)

It returns a string with the first character converted to uppercase.

viii. UPPER(expression)

  • Syntax

UPPER(expression)

In order to get a string converted to upper case.

ix. LOWER(expression)

  • Syntax

LOWER(expression)

This function in Pig converts all characters in a string to lower case.

x. REPLACE(string, ‘oldChar’, ‘newChar’);

  • Syntax

REPLACE(string, ‘oldChar’, ‘newChar’);

For replacing existing characters in a string with new characters.

xi. STRSPLIT(string, regex, limit)

  • Syntax

STRSPLIT(string, regex, limit)

In order to split a string around matches of a given regular expression.

xii. STRSPLITTOBAG(string, regex, limit)

It splits the string by given delimiter and returns the result in a bag.

xiii. TRIM(expression)

  • Syntax

TRIM(expression)

This Pig Built in Function is used to return a copy of a string with leading and trailing whitespaces removed.

xiv. LTRIM(expression)

It returns a copy of a string with leading whitespaces removed.

xv. RTRIM(expression)

For returning a copy of a string with trailing whitespaces removed.

e. Date and Time Functions

Here is the list of Date and Time – Pig Built in functions.

i. ToDate(milliseconds)

According to the given parameters, it returns a date-time object. There are more alternative for this functions. Such as ToDate(iosstring), ToDate(userstring, format), ToDate(userstring, format, timezone)

ii. CurrentTime()

It returns the date-time object of the current time.

iii. GetDay(datetime)

To get the day of a month as a return from the date-time object, we use it.

iv. GetHour(datetime)

GetHour returns the hour of a day from the date-time object.

v. GetMilliSecond(datetime)

It returns the millisecond of a second from the date-time object.

vi. GetMinute(datetime)

To get the minute of an hour in return from the date-time object, we use it.

vii. GetMonth(datetime)

GetMonth returns the month of a year from the date-time object.

viii. GetSecond(datetime)

It returns the second of a minute from the date-time object.

ix. GetWeek(datetime)

To get the week of a year as a return from the date-time object, we use it.

x. GetWeekYear(datetime)

GetWeekYear returns the week year from the date-time object.

xi. GetYear(datetime)

It returns the year from the date-time object.

xii. AddDuration(datetime, duration)

To get the result of a date-time object as a result along with the duration object, we use it.

xiii. SubtractDuration(datetime, duration)

SubtractDuration subtracts the duration object from the Date-Time object and returns the result.

xiv. DaysBetween(datetime1, datetime2)

DaysBetween returns the number of days between the two date-time objects.

xv. HoursBetween(datetime1, datetime2)

It returns the number of hours between two date-time objects.

xvi. MilliSecondsBetween(datetime1, datetime2)

To get the number of milliseconds as result between two date-time objects, we use it.

xvii. MinutesBetween(datetime1, datetime2)

MinutesBetween returns the number of minutes between two date-time objects.

xviii. MonthsBetween(datetime1, datetime2)

To get the number of months as a return between two date-time objects, we use it.

xix. SecondsBetween(datetime1, datetime2)

It returns the number of seconds between two date-time objects.

xx. WeeksBetween(datetime1, datetime2)

WeeksBetween returns the number of weeks between two date-time objects.

xxi. YearsBetween(datetime1, datetime2)

To get the number of years as a return between two date-time objects, we use it.

Any doubt yet in Pig Built in functions? Please Comment.

f. Math Functions

We have the following Math – Pig Built in functions−

i. ABS(expression)

  • Syntax

ABS(expression)

In order to get the absolute value of an expression.

ii. ACOS(expression)

  • Syntax

ACOS(expression)

It gives the arc cosine of an expression.

iii. ASIN(expression)

  • Syntax

ASIN(expression)

ASIN gives the arc sine of an expression.

iv. ATAN(expression)

  • Syntax

ATAN(expression

To get the arc tangent of an expression, we use it.

v. CBRT(expression)

  • Syntax

CBRT(expression)

It gives the cube root of an expression.

vi. CEIL(expression)

  • Syntax

CEIL(expression)

CEIL is used to get the value of an expression rounded up to the nearest integer.

vii. COS(expression)

  • Syntax

COS(expression)

In order to get the trigonometric cosine of an expression.

viii. COSH(expression)

  • Syntax

COSH(expression)

COSH gives the hyperbolic cosine of an expression.

ix. EXP(expression)

  • Syntax

EXP(expression)

To get the Euler’s number e raised to the power of x.

x. FLOOR(expression)

  • Syntax

FLOOR(expression)

In order to get the value of an expression rounded down to the nearest integer.

xi. LOG(expression)

  • Syntax

LOG(expression)

LOG gives the natural logarithm (base e) of an expression.

xii. LOG10(expression)

  • Syntax

LOG10(expression)

It gives the base 10 logarithms of an expression.

xiii. RANDOM( )

  • Syntax

RANDOM( )

In order to get a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0.

xiv. ROUND(expression)

  • Syntax

ROUND(expression)

ROUND gives the value of an expression rounded to an integer (if the result type is float) or rounded to a long (if the result type is double).

xv. SIN(expression)

  • Syntax

SIN(expression)

In order to get the sine of an expression.

xvi. SINH(expression)

  • Syntax

SINH(expression)

It gives the hyperbolic sine of an expression.

xvii. SQRT(expression)

  • Syntax

SQRT(expression)

SQRT gives the positive square root of an expression.

xviii. TAN(expression)

  • Syntax

TAN(expression)

In order to get the trigonometric tangent of an angle.

xix. TANH(expression)

  • Syntax

TANH(expression)

It gives the hyperbolic tangent of an expression.

So, this was all about Pig Built in Functions. Hope you like our explanation.

4. Conclusion: Apache Pig Built in Functions

As a result, we have seen all the Apache Pig Built in Functions in detail. Still, if any doubt occurs, feel free to ask in the comments.

For Reference


One thought on “Apache Pig Built in Functions Cheat Sheet

Comments are closed.