Hive SerDe – Custom & Built-in SerDe in Hive
For the purpose of IO, Apache Hive uses SerDe interface. However, there are many more insights to know about Hive SerDe. So, this document aims the whole concept of Hive SerDe. However, we will cover how to write own Hive SerDe.
Also, we will know about Registration of Native Hive SerDe, Built-in and How to write Custom SerDes in Hive, ObjectInspector, Hive Serde CSV, Hive Serde JSON, Hive Serde Regex, and Hive JSON Serde Example. In this way, we will cover each aspect of Hive SerDe to understand it well.
What is Hive SerDe?
Basically, for Serializer/Deserializer in Hive or Hive SerDe (an acronym). However, for the purpose of IO, we use the Hive SerDe interface. Hence, it handles both serialization and deserialization in Hive. Also, interprets the results of serialization as individual fields for processing.
In addition, to read in data from a table a SerDe allows Hive. Further writes it back out to HDFS in any custom format. However, it is possible that anyone can write their own SerDe for their own data formats.
- HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object
- Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files
It is very important to note that the “key” part is ignored when reading, and is always a constant when writing. However, Â row object is stored into the “value”.
Moreover, Hive does not own the HDFS file format.
Types of SerDe in Hive
Also, make sure that that org.apache.hadoop.hive.serde is the deprecated old Hive SerDe library. Hence, look at org.apache.hadoop.hive.serde2 for the latest version.
a. Built-in SerDes in Hive
Basically, to read and write HDFS files Hive uses these FileFormat classes currently:
- TextInputFormat/HiveIgnoreKeyTextOutputFormat
It read/write data in plain text file format.
- SequenceFileInputFormat/SequenceFileOutputFormat
It read/write data in Hadoop SequenceFile format.
Moreover, to serialize and deserialize data Hive uses these Hive SerDe classes currently:
- MetadataTypedColumnsetSerDe
So, to read/write delimited records we use this Hive SerDe. Such as CSV, tab-separated control-A separated records (sorry, quote is not supported yet).
- LazySimpleSerDe
Also, to read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol, we can use this Hive SerDe. Moreover, it creates Objects in a lazy way. Hence, that offers better performance.
Basically, with a specified encode charset starting in Hive 0.14.0, it supports read/write data.
For example:
ALTER TABLE person SET SERDEPROPERTIES (‘serialization.encoding’=’GBK’)
Since, the configuration property hive.lazysimple.extended_boolean_literal is set to true (Hive 0.14.0 and later) LazySimpleSerDe can treat ‘T’, ‘t’, ‘F’, ‘f’, ‘1’, and ‘0’ as extended, legal boolean literals.
However, the default is false. Hence it means only ‘TRUE’ and ‘FALSE’ are treated as legal boolean literals.
- Thrift SerDe in Hive
To read/write Thrift serialized objects, we use this Hive SerDe. However, make sure, for the Thrift object the class file must be loaded first.
- Dynamic SerDe in Hive
To read/write Thrift serialized objects we use this Hive SerDe. Although, it understands Thrift DDL so the schema of the object can be provided at runtime.
Also, it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).
Also:
- For JSON files, JsonSerDe was added in Hive 0.12.0. An Amazon SerDe is available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar for releases prior to 0.12.0.
- In Hive 0.9.1 an Avro SerDe was added. Starting in Hive 0.14.0 its specification is implicit with the STORED AS AVRO clause.
- Afterward, in Hive 0.11.0, a SerDe for the ORC file format was added.
- Further, in Hive 0.10 and natively in Hive 0.13.0 a SerDe for Parquet was added via the plug-in.
- Then, in Hive 0.14, a SerDe for CSV was added.
b. Custom Serde in Hive
How to write your Own Hive Serde:
- Despite Hive SerDe users want to write a Deserializer in most cases. It is because users just want to read their own data format instead of writing to it
- By using the configuration parameter ‘regex’, the RegexDeserializer will deserialize the data, and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe).
Some important points about Writing Hive SerDe:
- Basically, Hive SerDe, not the DDL, defines the table schema. Since some of the SerDe in Hive are implementations use the DDL for configuration. However, Â the SerDe can also override that.
- Moreover, Â Column types can be arbitrarily nested arrays, maps, and structures.
- However, with CASE/IF or when using complex or nested types the callback design of ObjectInspector allows lazy deserialization.
ObjectInspector
Basically, to analyze the internal structure of the row object and also the structure of the individual columns Hive uses ObjectInspector.
To be more specific, to access complex objects ObjectInspector offers a uniform way. Hence, it can be stored in multiple formats in the memory. However, it includes:
– An instance of a Java class. Either Thrift or native Java.
– A standard Java object. So, to represent Map we use java.util.List to represent Struct and Array, and use java.util.Map.
– A lazily-initialized object. For example, a Struct of string fields stored in a single Java string objects with starting offset for each field.
Moreover, by a pair of ObjectInspector and Java Object, we can represent a complex object. Also, it gives us ways to access the internal fields inside the Object apart from the information about the structure of the Object
Again, it is important to note that for serialization purposes, Hive recommends custom ObjectInspectors created for use with custom SerDes have a no-argument constructor in addition to their normal constructors.
Registration of Native SerDes
However, for native Hive SerDe, As of Hive 0.14, a registration mechanism has been introduced. However, in place of a triplet of {SerDe, InputFormat, and OutputFormat} specification, in CreateTable statements, it allows dynamic binding between a “STORED AS” keyword.
Moreover, through this registration mechanism, we can add the following mappings:
Table.1- Hive SerDe – Native SerDes in HiveÂ
Syntax | Equivalent |
STORED AS AVRO / STORED AS AVROFILE | ROW FORMAT SERDE Â ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’ Â STORED AS INPUTFORMAT Â ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’ Â OUTPUTFORMAT Â ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’ |
STORED AS ORC / STORED AS ORCFILE | ROW FORMAT SERDE Â ‘org.apache.hadoop.hive.ql.io.orc.OrcSerde’ Â STORED AS INPUTFORMAT Â ‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ Â OUTPUTFORMAT Â ‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’ |
STORED AS PARQUET / STORED AS PARQUETFILE | ROW FORMAT SERDE Â ‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’ Â STORED AS INPUTFORMAT Â ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat’ Â OUTPUTFORMAT Â ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat’ |
STORED AS RCFILE | STORED AS INPUTFORMAT Â ‘org.apache.hadoop.hive.ql.io.RCFileInputFormat’ Â OUTPUTFORMAT Â ‘org.apache.hadoop.hive.ql.io.RCFileOutputFormat’ |
STORED AS SEQUENCEFILE | STORED AS INPUTFORMAT Â ‘org.apache.hadoop.mapred.SequenceFileInputFormat’ Â OUTPUTFORMAT Â ‘org.apache.hadoop.mapred.SequenceFileOutputFormat’ |
STORED AS TEXTFILE | STORED AS INPUTFORMAT Â ‘org.apache.hadoop.mapred.TextInputFormat’ Â OUTPUTFORMAT Â ‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’ |
Further, Â follow these steps to add a new native Hive SerDe with “STORED AS” keyword:
– At, first from AbstractStorageFormatDescriptor.java create a storage format descriptor class extending. Then it returns a “stored as” keyword and the names of InputFormat, OutputFormat, and Hive SerDe classes.
– Moreover, Â add the name of the storage format descriptor class to the StorageFormatDescriptor registration file.
This was all about Apache Hive SerDe Tutorial. Hope you like our explanation of SerDes in Hive.
Conclusion
As a result, we have seen the whole concept of Hive SerDe, how to write own Hive SerDe, Registration of Native SerDe, Built-in Serde in Hive, How to write Custom SerDes in Hive, ObjectInspector, and some example of SerDe in Hive. However, if you feel any query feel free to ask in the comment section.
Your opinion matters
Please write your valuable feedback about DataFlair on Google
Guys your are the best. I’m a beginer in hadoop your tutorials helped me a lot. Thank you so much.
I am a graduate in Mathematics I have completed hadoop developer course from IIHT kochi. I’d like to study more about data science. You have any suggestions please respond via mail
Please share few examples for each SerDe so that we can understand quickly.