Site icon DataFlair

Avro Serialization | Serialization In Java & Hadoop

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Today, we will learn Avro Serialization in detail. It includes Serialization Encodings in Avro, brief knowledge on Avro Serialization in Java and also we will cover Avro Serialization in Hadoop in detail.

Also, we will see the advantages and disadvantages of Hadoop over Java Avro Serialization. However, there is much more to learn about Avro Serialization in detail.

So, let’s begin with the introduction to Avro Serialization.

Avro Serialization | Serialization In Java & Hadoop

What is Avro Serialization?

In order to transport the data over the network or to store on some persistent storage, we use the process of translating data structures or objects state into binary or textual form, that process is what we call Serialization in Avro.

However, we need to deserialize the Data again, once the data transport over the network or retrieved from the persistent storage. In other words, Avro Serialization is known as marshaling and deserialization in Avro is known as unmarshalling.

Moreover, we can say, with its schema only Avro data serializes. Although, the Files which store Avro data must also involve the schema for that data in the same file.

However, there is a Remote Procedure Call (RPC) systems which is based on Avro, that must guarantee when the remote recipients of data have a copy of the schema which is used to write that data.

However, when the data is read, the schema which is used to write data is always available, that means Avro data is not tagged with type information, itself.  We need a schema to parse data.

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

Generally, the way in which both the Avro serialization as well as deserialization proceed is, depth-first, left-to-right traversal of the schema. Especially, by serializing primitive types since they encounter.

Encodings in Avro Serialization

There are two serialization encodings available in Avro. One of them is binary encoding and the other one is JSON encoding. Since Binary coding is smaller and faster, most applications use the binary encoding.

Although, sometimes the JSON encoding is appropriate for debugging as well as web-based applications. So, let’s learn both the encodings in detail:

Encodings in Avro Serialization

a. Binary Encoding in Avro

Primitive Types

In binary, Primitive types are encoded as:

Example of Binary Encoding in Avro Serialization:

value hax
0 00
-1 01
1 02
-2 03
2 04
-64 7f
64 80 01

b. JSON Encoding in Avro

Basically, the JSON encoding in Avro Serialization is the same as we use to encode field default values, except for unions.

In JSON, the value of a union is encoded as:

Now, let’s understand it with an example, here the union schema [“null”,”string”,”Foo”], where Foo is a record name, and that would encode as:

Although make sure, to correctly process JSON-encoded data, we need a schema. As an example, the JSON encoding does not consider any difference between records and maps, int and long or float and double and many more.

i. Single-object encoding

As there is the time when we need to store a single Avro serialized object for a longer period of time. However, a very common example for this is to store Avro records for various weeks in an Apache Kafka topic.

ii. Single object encoding specification

We encode this as – 

In addition, to determine whether a payload is Avro, Implementations use the 2-byte marker. However, when the message doesn’t encod Avro payload, this check helps avoid expensive lookups which resolve the schema from a fingerprint.

Avro Serialization in Java

There is a mechanism in Java which is known as object serialization. We can represent an object as a byte sequence that includes the object’s data as well as information about the object’s type and the types of data stored in the object.

It is possible to deserialize after serialization, once a serialized object is written into a file, we can read it and read it from the file and deserialized. In addition, to serialize and deserialize an object, ObjectInputStream and ObjectOutputStream class uses respectively in Java.

Avro Serialization in Hadoop

Especially for Interprocess Communication and Persistent Storage, in distributed systems like Hadoop, the concept of serialization is used.

Avro Serialization in Hadoop

a. Interprocess Communication

  1. Basically, RPC technique was used, to establish the interprocess communication between the nodes connected in a network.
  2. In order to convert the message into the binary format before sending it to the remote node via the network, RPC uses internal serialization. Further, the remote system deserializes the binary stream into the original message,  at the other end.
  3. We need to follow the RPC serialization format −

In order to use network bandwidth efficiently, which is the most scarce resource in a data center.

The serialization and deserialization process should be quick with less overhead since the communication between the nodes is crucial in distributed systems.

It should be straightforward to evolve the protocol in a controlled manner for clients and servers because Protocols change over time to meet new requirements.

The nodes that we write in different languages, must support by the message format.

b. Persistent Storage

Whereas, a digital storage which does not lose its data if any loss of power supply happens, that storage system is we call Persistent Storage. Its examples could be Magnetic disks and Hard Disk Drives.

The Writable Interface

Basically, the Writable Interface in Avro Serialization in Hadoop offers two methods for serialization as well as deserialization in Hadoop. The methods are:

To deserialize the fields of the given object, we use this method.

Whereas, to serialize the fields of the given object we use this method.

WritableComparable Interface

The WritableComparable Interface in Avro Serialization is the combination of two interfaces, one is Writable and the other one is Comparable Interfaces.

Basically, this interface inherits the Comparable interface of Java and Writable interface of Hadoop. Hence it offers methods for data serialization, deserialization, and comparison as well.

So, the method is:

The int compareTo(class obj) method, compares the current object with the given object obj.
Also, there is the number of wrapper classes which implement the WritableComparable interface in Hadoop.

Here each class wraps a Java primitive type. Now, we can see the Hadoop serialization class hierarchy in the following figure −

Avro Serialization – WritableComparable Interface

Hence, to serialize various types of data in Hadoop, these classes are useful.

IntWritable Class

The IntWritable Class in Avro serialization implements Writable, Comparable, as well as WritableComparable interfaces. Basically, it wraps an integer data type in it. Also, to serialize and deserialize integer type of data, this class offers some methods:
a. Constructors
IntWritable()

b. Methods

We can get the integer value present in the current object, by using this class.

In order to deserialize the data in the given DataInput object, we use this method.

Moreover, to set the value of the current IntWritable object, we use this method.

Whereas,  to serialize the data in the current object to the given DataOutput object, we use this method.

Serializing the Data in Hadoop

Now to serialize the integer type of data in Hadoop, the procedure is:

Deserializing the Data in Hadoop

After serialization, the process to deserialize the integer type of data is :

Advantage of Hadoop Over Java Serialization

Basically, by reusing the Writable objects, Hadoop’s Writable-based serialization is capable of reducing the object-creation overhead. And, the Java’s native serialization framework cannot do this.

Disadvantages of Hadoop Serialization

There are two ways to serialize Hadoop data:

However, one main disadvantage of using these two mechanisms,i.e. both Writables and SequenceFiles have only a Java API, that says we can not write or read it in any other language.

Hence that makes Hadoop a limited box. Hence, we can say Doug Cutting created Avro which is a language-independent data structure just to address this drawback.

So, this was all in Apache Avro Serialization. Hope you like our explanation.

Conclusion: Avro Serialization

Hence, we have seen the concept of Avro Serialization in detail. In this Avro Serialization Tutorial, we look at serialization in Java, Serialization in Hadoop, encoding in Avro Serialization. Moreover, we discussed the advantages and disadvantages of Hadoop Serialization.

Also, we saw Writable interface and inwritable class in Avro Serialization. Furthermore, if any doubt occurs regarding Serialization In Apache Avro, feel free to ask in the comment section. We are happy to help.

Exit mobile version