Hive SerDe – Custom & Built-in SerDe in Hive

1. Hive SerDe – Objective

For the purpose of IO, Apache Hive uses SerDe interface. However, there are many more insights to know about Hive SerDe. So, this document aims the whole concept of Hive SerDe. However, we will cover how to write own Hive SerDe. Also, we will know about Registration of Native Hive SerDe, Built-in and How to write Custom SerDes in Hive, ObjectInspector, Hive Serde CSV, Hive Serde JSON, Hive Serde Regex, and Hive JSON Serde Example. In this way, we will cover each aspect of Hive SerDe to understand it well.

Hive SerDe

Hive SerDe – Custom & Built-in SerDe in Hive

2. What is Hive SerDe?

Basically, for Serializer/Deserializer in Hive or Hive SerDe (an acronym). However, for the purpose of IO, we use the Hive SerDe interface. Hence, it handles both serialization and deserialization in Hive. Also, interprets the results of serialization as individual fields for processing.
In addition, to read in data from a table a SerDe allows Hive. Further writes it back out to HDFS in any custom format. However, it is possible that anyone can write their own SerDe for their own data formats.

  • HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object
  • Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files

It is very important to note that the “key” part is ignored when reading, and is always a constant when writing. However,  row object is stored into the “value”.

Moreover, Hive does not own the HDFS file format.
Let’s revise Apache Hive Operators in detail

If these professionals can make a switch to Big Data, so can you:
Rahul Doddamani Story - DataFlair
Rahul Doddamani
Java → Big Data Consultant, JDA
Follow on
Mritunjay Singh Success Story - DataFlair
Mritunjay Singh
PeopleSoft → Big Data Architect, Hexaware
Follow on
Rahul Doddamani Success Story - DataFlair
Rahul Doddamani
Big Data Consultant, JDA
Follow on
I got placed, scored 100% hike, and transformed my career with DataFlair
Enroll now
Deepika Khadri Success Story - DataFlair
Deepika Khadri
SQL → Big Data Engineer, IBM
Follow on
DataFlair Web Services
You could be next!
Enroll now

3. Types of SerDe in Hive

Also, make sure that that org.apache.hadoop.hive.serde is the deprecated old Hive SerDe library. Hence, look at org.apache.hadoop.hive.serde2 for the latest version.

a. Built-in SerDes in Hive

Basically, to read and write HDFS files Hive uses these FileFormat classes currently:

  • TextInputFormat/HiveIgnoreKeyTextOutputFormat

It read/write data in plain text file format.

  • SequenceFileInputFormat/SequenceFileOutputFormat

It read/write data in Hadoop SequenceFile format.
Moreover, to serialize and deserialize data Hive uses these Hive SerDe classes currently:

  • MetadataTypedColumnsetSerDe

So, to read/write delimited records we use this Hive SerDe. Such as CSV, tab-separated control-A separated records (sorry, quote is not supported yet).
Read about Hive UDF – User-Defined Function 

  • LazySimpleSerDe

Also, to read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol, we can use this Hive SerDe. Moreover, it creates Objects in a lazy way. Hence, that offers better performance.
Basically, with a specified encode charset starting in Hive 0.14.0, it supports read/write data.
For example:

ALTER TABLE person SET SERDEPROPERTIES (‘serialization.encoding’=’GBK’)
Since, the configuration property hive.lazysimple.extended_boolean_literal is set to true (Hive 0.14.0 and later) LazySimpleSerDe can treat ‘T’, ‘t’, ‘F’, ‘f’, ‘1’, and ‘0’ as extended, legal boolean literals. However, the default is false. Hence it means only ‘TRUE’ and ‘FALSE’ are treated as legal boolean literals.

  • Thrift SerDe in Hive

To read/write Thrift serialized objects, we use this Hive SerDe. However, make sure, for the Thrift object the class file must be loaded first.

  • Dynamic SerDe in Hive

To read/write Thrift serialized objects we use this Hive SerDe. Although, it understands Thrift DDL so the schema of the object can be provided at runtime. Also, it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).
Also:

  • For JSON files, JsonSerDe was added in Hive 0.12.0. An Amazon SerDe is available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar for releases prior to 0.12.0.
  • In Hive 0.9.1 an Avro SerDe was added. Starting in Hive 0.14.0 its specification is implicit with the STORED AS AVRO clause.
  • Afterward, in Hive 0.11.0, a SerDe for the ORC file format was added.
  • Further, in Hive 0.10 and natively in Hive 0.13.0 a SerDe for Parquet was added via the plug-in.
  • Then, in Hive 0.14, a SerDe for CSV was added.

Let’s discuss Hive DDL Commands: Types of DDL Hive Commands

b. Custom Serde in Hive

How to write your Own Hive Serde:

  • Despite Hive SerDe users want to write a Deserializer in most cases. It is because users just want to read their own data format instead of writing to it
  • By using the configuration parameter ‘regex’, the RegexDeserializer will deserialize the data, and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe).

Some important points about Writing Hive SerDe:

  • Basically, Hive SerDe, not the DDL, defines the table schema. Since some of the SerDe in Hive are implementations use the DDL for configuration. However,  the SerDe can also override that.
  • Moreover,  Column types can be arbitrarily nested arrays, maps, and structures.
  • However, with CASE/IF or when using complex or nested types the callback design of ObjectInspector allows lazy deserialization.

4. ObjectInspector

Basically, to analyze the internal structure of the row object and also the structure of the individual columns Hive uses ObjectInspector.
Let’s look at What is Hive Partitions

To be more specific, to access complex objects ObjectInspector offers a uniform way. Hence, it can be stored in multiple formats in the memory. However, it includes:

– An instance of a Java class. Either Thrift or native Java.
– A standard Java object. So, to represent Map we use java.util.List to represent Struct and Array, and use java.util.Map.
– A lazily-initialized object. For example, a Struct of string fields stored in a single Java string objects with starting offset for each field.
Moreover, by a pair of ObjectInspector and Java Object, we can represent a complex object. Also, it gives us ways to access the internal fields inside the Object apart from the information about the structure of the Object

Again, it is important to note that for serialization purposes, Hive recommends custom ObjectInspectors created for use with custom SerDes have a no-argument constructor in addition to their normal constructors.

Hadoop Quiz

5. Registration of Native SerDes

However, for native Hive SerDe, As of Hive 0.14, a registration mechanism has been introduced. However, in place of a triplet of {SerDe, InputFormat, and OutputFormat} specification, in CreateTable statements, it allows dynamic binding between a “STORED AS” keyword.

Read about Hive Join – HiveQL Select Joins Query and Its Types
Moreover, through this registration mechanism, we can add the following mappings:
Table.1- Hive SerDe – Native SerDes in Hive 

SyntaxEquivalent
STORED AS AVRO /

STORED AS AVROFILE
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
STORED AS INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
STORED AS ORC /

STORED AS ORCFILE
ROW FORMAT SERDE
‘org.apache.hadoop.hive.ql.io.orc.OrcSerde’
STORED AS INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’
STORED AS PARQUET /

STORED AS PARQUETFILE
ROW FORMAT SERDE
‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’
STORED AS INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat’
STORED AS RCFILESTORED AS INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.RCFileInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.RCFileOutputFormat’
STORED AS SEQUENCEFILESTORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.SequenceFileInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.mapred.SequenceFileOutputFormat’
STORED AS TEXTFILESTORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’

Further,  follow these steps to add a new native Hive SerDe with “STORED AS” keyword:

– At, first from AbstractStorageFormatDescriptor.java create a storage format descriptor class extending. Then it returns a “stored as” keyword and the names of InputFormat, OutputFormat, and Hive SerDe classes.

– Moreover,  add the name of the storage format descriptor class to the StorageFormatDescriptor registration file.
This was all about Apache Hive SerDe Tutorial. Hope you like our explanation of SerDes in Hive.

6. Conclusion

As a result, we have seen the whole concept of Hive SerDe, how to write own Hive SerDe, Registration of Native SerDe, Built-in Serde in Hive, How to write Custom SerDes in Hive, ObjectInspector, and some example of SerDe in Hive. However, if you feel any query feel free to ask in the comment section.
Related Topic- Hive Internal Tables vs External Tables
For reference

2 Responses

  1. Sreenath V G says:

    Guys your are the best. I’m a beginer in hadoop your tutorials helped me a lot. Thank you so much.
    I am a graduate in Mathematics I have completed hadoop developer course from IIHT kochi. I’d like to study more about data science. You have any suggestions please respond via mail

  2. bala says:

    Please share few examples for each SerDe so that we can understand quickly.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.