Avro SerDe | Avro Serialization & Deserialization Using Parsers

1. Avro SerDe Using Parsers

In our previous Avro tutorial, we discussed Avro SerDe with code generation. Today, we will see Avro SerDe using Parsers. So, in this article, “Avro Serialization and deserialization” we will learn to read the schema by using the parsers library and also to serialize and deserialize the data using Avro. As we have also discussed in the previous article, we can read an Avro schema into a program by two ways, either with code generation or without code generation, that is by using the parsers library. However, must read our previous article, “Avro SerDe by generating class” because we are using the same example as in the previous article, yet without code generation.

So, let’s start Serialization and Deserialization In Avro Using Parsers.

Avro SerDe | Avro Serialization & Deserialization Using Parsers

Avro SerDe | Avro Serialization & Deserialization Using Parsers

2. Serialization and Deserialization in Avro

While it comes to transport the data over the network or to store on some persistent storage, we need to perform the process of translating data structures or objects state into the binary or textual form that process is what we call Serialization. However, after the process of Serialization, we also need to deserialize it again.

You must see the best books for Apache Avro

Basically, the data is always stored with its corresponding schema, in Avro. Hence, it is possible to read a schema even without code generation. Although there are several steps for it, they are:

  • Creating users
  • Serializing
  • Deserializing
  • Compiling and running the example code

Get the most demanding skills of IT Industry - Learn Hadoop

3. Creating Users For Avro SerDe

At very first, to read our schema definition and create a Schema object we are using a Parser.

Schema schema = new Schema.Parser().parse(new File("user.avsc"));     

Now, let’s create some users, using this schema.

GenericRecord user1 = new GenericData.Record(schema);
user1.put("name", "Chandler");
user1.put("favorite_number", 256);
// Leave favorite color null
GenericRecord user2 = new GenericData.Record(schema);
user2.put("name", "Liza");
user2.put("favorite_number", 7);
user2.put("favorite_color", "red")

Basically, here we use GenericRecords to represent users, as we’re not using code generation. Moreover, to verify that we only specify valid fields, GenericRecord uses the schema. Further, we’ll get an AvroRuntimeException when we run the program, if we try to set a non-existent field (e.g., user1.put(“favorite_animal”, “cat”)).

Let’s revise Top Avro Features

Make sure that we do not set user1’s favorite color. We can either set it to a string or leave it null since that record is of type [“string”, “null”]; it is essentially optional.

4. Serializing in Apache Avro

However, serializing and deserializing user objects is almost similar to the above example with code generation. But the main difference between them is here we use generic rather than specific readers and writers.

Now, we’ll serialize our users to a data file on disk, at first.

Test how much you learned in Avro

// Serialize user1 and user2 to disk
File file = new File("users.avro");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, file);
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.close();        

Similarly, here also we are creating a DatumWriter, that use to convert Java objects into an in-memory serialized format, but here we create a GenericDatumWriter due to lack of code generation. However, for both the process either to determine how to write the GenericRecords or to verify that all non-nullable fields are present, it requires the Avro Schema.

In addition, here also we will create a DataFileWriter, basically, that writes the serialized record and the schema, to the file specified in the dataFileWriter.create call, as same as the code generation example. Further, via calls to the dataFileWriter.append method, we write our users to the file. Also, we close the data file when we are done writing.

5. Deserializing in Avro SerDe Using Parser

Now, after serializing we’ll deserialize the data file:

// Deserialize users from disk
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, datumReader);
GenericRecord user = null;
while (dataFileReader.hasNext()) {
// Reuse user object by passing it to next(). This saves us from
// allocating and garbage collecting many objects for files with
// many items.
user = dataFileReader.next(user);
System.out.println(user);

So, Output is:
{“name”: “Chandler”, “favorite_number”: 256, “favorite_color”: null}
{“name”: “Liza”, “favorite_number”: 7, “favorite_color”: “red”}  

Analogous to the GenericDatumWriter we used in serialization here also we will create a GenericDatumReader, that converts in-memory serialized items into GenericRecords. Furthermore, analogous to the DataFileWriter again we will pass the DatumReader and the previously created File to a DataFileReader, that reads the data file on disk.
In addition, to iterate through the serialized users and print the deserialized object to stdout, we use the DataFileReader.

Have a look at Apache Avro Reference API

Now see the process of iteration we perform. As a process, we create a single GenericRecord object at first, in which we store the current deserialized user. Further, we pass this record object to every call of dataFileReader.next. In other words, we call it performance optimization process which permits the DataFileReader in order to reuse the same record object instead of allocating a new GenericRecord for every iteration process. Because if we deserialize a large data file, it can be very expensive in terms of object allocation and garbage collection. Though it’s also possible to use for (GenericRecord user: dataFileReader) if performance is not a concern while this technique is the standard way to iterate through a data file.

Hadoop Quiz

6. Compiling and Running the Example Code

Now, to build and run the example, execute the following commands:

$ mvn compile
$ mvn -q exec:java -Dexec.mainClass=example.GenericMain

Hence, this was all in Serialization and Deserialization in Avro. Hope you like our explanation.
So, this was all in Avro SerDe using Parsers. Hope you like our explanation.

7. Conclusion: Avro SerDe Using Parsers

Hence, we have seen the concept of Avro SerDe using Parsers in detail. Moreover, we discussed creating users for Avro SerDe. Though if any doubt occurs regarding Avro Serialization and Deserialization, feel free to ask in the comment section. Hope it helps!

See also – 

Avro Interview Questions

For reference

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.