Site icon DataFlair

Hive SerDe – Custom & Built-in SerDe in Hive

Hive SerDe

Hive SerDe

For the purpose of IO, Apache Hive uses SerDe interface. However, there are many more insights to know about Hive SerDe. So, this document aims the whole concept of Hive SerDe. However, we will cover how to write own Hive SerDe.

Also, we will know about Registration of Native Hive SerDe, Built-in and How to write Custom SerDes in Hive, ObjectInspector, Hive Serde CSV, Hive Serde JSON, Hive Serde Regex, and Hive JSON Serde Example. In this way, we will cover each aspect of Hive SerDe to understand it well.

What is Hive SerDe?

Basically, for Serializer/Deserializer in Hive or Hive SerDe (an acronym). However, for the purpose of IO, we use the Hive SerDe interface. Hence, it handles both serialization and deserialization in Hive. Also, interprets the results of serialization as individual fields for processing.

In addition, to read in data from a table a SerDe allows Hive. Further writes it back out to HDFS in any custom format. However, it is possible that anyone can write their own SerDe for their own data formats.

It is very important to note that the “key” part is ignored when reading, and is always a constant when writing. However,  row object is stored into the “value”.

Moreover, Hive does not own the HDFS file format.

Types of SerDe in Hive

Also, make sure that that org.apache.hadoop.hive.serde is the deprecated old Hive SerDe library. Hence, look at org.apache.hadoop.hive.serde2 for the latest version.

a. Built-in SerDes in Hive

Basically, to read and write HDFS files Hive uses these FileFormat classes currently:

It read/write data in plain text file format.

It read/write data in Hadoop SequenceFile format.
Moreover, to serialize and deserialize data Hive uses these Hive SerDe classes currently:

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

So, to read/write delimited records we use this Hive SerDe. Such as CSV, tab-separated control-A separated records (sorry, quote is not supported yet).

Also, to read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol, we can use this Hive SerDe. Moreover, it creates Objects in a lazy way. Hence, that offers better performance.

Basically, with a specified encode charset starting in Hive 0.14.0, it supports read/write data.
For example:

ALTER TABLE person SET SERDEPROPERTIES (‘serialization.encoding’=’GBK’)
Since, the configuration property hive.lazysimple.extended_boolean_literal is set to true (Hive 0.14.0 and later) LazySimpleSerDe can treat ‘T’, ‘t’, ‘F’, ‘f’, ‘1’, and ‘0’ as extended, legal boolean literals.

However, the default is false. Hence it means only ‘TRUE’ and ‘FALSE’ are treated as legal boolean literals.

To read/write Thrift serialized objects, we use this Hive SerDe. However, make sure, for the Thrift object the class file must be loaded first.

To read/write Thrift serialized objects we use this Hive SerDe. Although, it understands Thrift DDL so the schema of the object can be provided at runtime.

Also, it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).
Also:

b. Custom Serde in Hive

How to write your Own Hive Serde:

Some important points about Writing Hive SerDe:

ObjectInspector

Basically, to analyze the internal structure of the row object and also the structure of the individual columns Hive uses ObjectInspector.

To be more specific, to access complex objects ObjectInspector offers a uniform way. Hence, it can be stored in multiple formats in the memory. However, it includes:

– An instance of a Java class. Either Thrift or native Java.
– A standard Java object. So, to represent Map we use java.util.List to represent Struct and Array, and use java.util.Map.
– A lazily-initialized object. For example, a Struct of string fields stored in a single Java string objects with starting offset for each field.

Moreover, by a pair of ObjectInspector and Java Object, we can represent a complex object. Also, it gives us ways to access the internal fields inside the Object apart from the information about the structure of the Object

Again, it is important to note that for serialization purposes, Hive recommends custom ObjectInspectors created for use with custom SerDes have a no-argument constructor in addition to their normal constructors.

Registration of Native SerDes

However, for native Hive SerDe, As of Hive 0.14, a registration mechanism has been introduced. However, in place of a triplet of {SerDe, InputFormat, and OutputFormat} specification, in CreateTable statements, it allows dynamic binding between a “STORED AS” keyword.

Moreover, through this registration mechanism, we can add the following mappings:
Table.1- Hive SerDe – Native SerDes in Hive 

Syntax Equivalent
STORED AS AVRO /

STORED AS AVROFILE
ROW FORMAT SERDE
 ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
 STORED AS INPUTFORMAT
 ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
 OUTPUTFORMAT
 ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
STORED AS ORC /

STORED AS ORCFILE
ROW FORMAT SERDE
 ‘org.apache.hadoop.hive.ql.io.orc.OrcSerde’
 STORED AS INPUTFORMAT
 ‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
 OUTPUTFORMAT
 ‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’
STORED AS PARQUET /

STORED AS PARQUETFILE
ROW FORMAT SERDE
 ‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’
 STORED AS INPUTFORMAT
 ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat’
 OUTPUTFORMAT
 ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat’
STORED AS RCFILE STORED AS INPUTFORMAT
 ‘org.apache.hadoop.hive.ql.io.RCFileInputFormat’
 OUTPUTFORMAT
 ‘org.apache.hadoop.hive.ql.io.RCFileOutputFormat’
STORED AS SEQUENCEFILE STORED AS INPUTFORMAT
 ‘org.apache.hadoop.mapred.SequenceFileInputFormat’
 OUTPUTFORMAT
 ‘org.apache.hadoop.mapred.SequenceFileOutputFormat’
STORED AS TEXTFILE STORED AS INPUTFORMAT
 ‘org.apache.hadoop.mapred.TextInputFormat’
 OUTPUTFORMAT
 ‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’

Further,  follow these steps to add a new native Hive SerDe with “STORED AS” keyword:

– At, first from AbstractStorageFormatDescriptor.java create a storage format descriptor class extending. Then it returns a “stored as” keyword and the names of InputFormat, OutputFormat, and Hive SerDe classes.

– Moreover,  add the name of the storage format descriptor class to the StorageFormatDescriptor registration file.
This was all about Apache Hive SerDe Tutorial. Hope you like our explanation of SerDes in Hive.

Conclusion

As a result, we have seen the whole concept of Hive SerDe, how to write own Hive SerDe, Registration of Native SerDe, Built-in Serde in Hive, How to write Custom SerDes in Hive, ObjectInspector, and some example of SerDe in Hive. However, if you feel any query feel free to ask in the comment section.

Exit mobile version