AVRO Serialization and Deserialization: With Code Generation

FREE Online Courses: Click, Learn, Succeed, Start Now!

Today in this Avro Tutorial, we will learn Avro Serialization and Deserialization with Code Generations. Moreover, we will see defining and compiling Avro Schema. Also, we will see Serializing and Deserializing Avro.

There are two possible ways to read an Avro schema into the program, one is by generating a class/code generation corresponding to a schema or another one is by using the parsers library.”. Also, we will see to Deserialize the data by using Avro in detail.

So, let’s start Avro Serialization and Deserialization with Code Generation.

Avro Serialization and Deserialization

In order to transport the data over the network or to store on some persistent storage. There is a process of translating data structures or objects state into binary or textual form, is what we call Serialization process.

However, we need to deserialize the data as soon as the data is transported over the network or retrieved from the persistent storage.

So, in order to serialize and deserialize the data by using Avro, the steps are −

  • Define an Avro schema.
  • Compile the schema using Avro utility.
  • Creating Users.
  • Serialize it using Avro library.
  • Deserializing with code generation.
  • Compiling and running.

Now, let’s learn Avro Serialization and Deserialization steps in detail.

Defining an Avro Schema

Basically, by using JSON, Avro schemas are defined. Generally, these Schemas are composed of primitive types as well as complex types.

However, primitive types are of several types such as null, boolean, int, long, float, double, bytes, and string whereas complex types are of various types such as record, enum, array, map, union, and fixed.

So, let’s start with a simple schema example, user.avsc:

{"namespace": "example1.avro",
"type": "record",
"name": "User",
"fields": [
    {"name": "name", "type": "string"},
    {"name": "favorite_number",  "type": ["int", "null"]},
    {"name": "favorite_color", "type": ["string", "null"]}
]
}    

However, by representing a hypothetical user, this schema defines a record. Although, make sure that a schema file can only contain a single schema definition. In general, it is must that a record definition contain its type (“type”: “record”), a name (“name”: “User”), and fields.

Though. In this case, the name is favorite_number, and favorite_color. In addition, we also define a namespace (“namespace”: “example1.avro”), along with the name attribute. So that attribute defines the “full name” of the schema (example1.avro.User in this case).

Furthermore, via an array of objects, fields are defined, each of which defines a name and type. Moreover, another schema object is the type attribute of a field, though it can be of any type either a primitive or a complex type.

As an example, our user schema name field is the primitive type string, and both the fields  favorite_number and favorite_color are unions, which are represented by JSON arrays. As we have seen earlier that unions are a complex type so it can be any of the types listed in the array.

That says favorite_number can be an int or null, essentially making it an optional field.

Compiling the Avro Schema

In order to automatically create classes on the basis of our previously-defined schema, code generation permits us. Although, there is no need to use the schema directly in our programs, once we have defined the relevant classes. So, to generate code, we use the avro-tools jar as follows:

java -jar /path/to/avro-tools-1.8.2.jar compile schema <schema file> <destination>      

On the basis of schema’s namespace in the provided destination folder, this will generate the appropriate source files in a package. For example, we can generate a user class in package example1.avro from the schema defined above, run

java -jar /path/to/avro-tools-1.8.2.jar compile schema user.avsc .        

Make sure there is no need to manually invoke the schema compiler if we are using the Avro Maven plugin; So on any .avsc files which are present in the configured source directory, the plugin automatically performs code generation.

Creating Users For Avro Serialization

Further, create some Users and then serialize them to a data file on disk. Further, read back the file and then deserialize the User objects since we’ve completed the code generation.
At very first let’s create some Users and also set their fields.

User user1 = new User();
user1.setName("Chandler");
user1.setFavoriteNumber(256);
// Leave favorite color null
// Alternate constructor
User user2 = new User("Liza", 7, "red");
// Construct via builder
User user3 = User.newBuilder()
            .setName("Ross")
            .setFavoriteColor("blue")
            .setFavoriteNumber(null)
            .build();        

As you can see, either by using a builder or by invoking a constructor directly, we can create Avro objects. Dissimilar to constructors, builders will automatically set any default values specified in the schema.

In addition, objects constructed directly will not cause an error until the object is serialized since the builders validate the data as it set. Although, as builders create a copy of the data structure before it is written, using constructors directly generally offers better performance.

However, make sure that we do not set user1’s favorite color. As in that record is of type [“string”, “null”], hence either we can set it to a string or leave it null; though it is essentially optional. Similarly, by using a builder requires setting all fields, even if they are null, we set user3’s favorite number to null.

Serializing in Apache Avro

Further, let’s serialize our Users to disk.

// Serialize user1, user2 and user3 to disk
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();      

To convert Java objects into an in-memory serialized format, here we create a DatumWriter. Though, with generated classes and extracts the schema from the specified generated type, we use the SpecificDatumWriter class.

Afterward, to write the serialized records, as well as the schema, to the file specified in the dataFileWriter, we create a DataFileWriter.create call. Also, via calls to the dataFileWriter.append method, we write our users to the file. Hence, we close the data file, when we are done writing.

Deserializing With Code Generation

Now, after serialization, we will read the schema by generating a class and also learn to Deserialize the data by using Avro.

Now, let’s deserialize the data file:

// Deserialize Users from disk
DatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class);
DataFileReader<User> dataFileReader = new DataFileReader<User>(file, userDatumReader);
User user = null;
while (dataFileReader.hasNext()) {
// Reuse user object by passing it to next(). This saves us from
// allocating and garbage collecting many objects for files with
// many items.
user = dataFileReader.next(user);
System.out.println(user);
}        

Output:
{“name”: “Chandler”, “favorite_number”: 256, “favorite_color”: null}
{“name”: “Liza”, “favorite_number”: 7, “favorite_color”: “red”}
{“name”: “Ross”, “favorite_number”: null, “favorite_color”: “blue”}      

Although, we can say Avro Serialization and Deserialization are very similar. Now as same as SpecificDatumWriter in Serialization. Here we create a SpecificDatumReader, basically, that converts in-memory serialized items into instances of our generated class, in this case, User. We pass the DatumReader and the previously created File to a DataFileReader, as same as DataFileWriter in serialization and that reads the data file on disk.

Further, to iterate through the serialized Users and print the deserialized object to stdout, we use the DataFileReader. Now see the process of iteration we perform. As a process we create a single User object at first, further we store the current deserialized user in that object. Afterward, we pass this record object to every call of dataFileReader.next.

Also, we call it performance optimization. Basically, it permits the DataFileReader to reuse the same User object instead of allocating a new User for every iteration. It is because in terms of object allocation and garbage collection that can be very expensive when we deserialize a large data file. Hence we can say it is the standard way to iterate through a data file so if performance is not a concern it’s also possible to use for (User user: dataFileReader).

Compiling and Running the Example Code

Further, to build and run the example, execute the following commands:

$ mvn compile # includes code generation via Avro Maven plugin
$ mvn -q exec:java -Dexec.mainClass=example1.SpecificMain

So, this was all in Avro Serialization and Deserialization. Hope you like our explanation.

Conclusion: AVRO Serialization and Deserialization 

Hence, we have seen the whole concept of AVRO Serialization and Deserialization with code generation. So if any doubt occurs regarding AVRO Serialization and Deserialization, feel free to ask in the comment tab.

And, in our next tutorial, we will see how to perform AVRO Serialization and Deserialization without code generation. Keep visiting, keep learning!

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google

follow dataflair on YouTube

2 Responses

  1. Kumar says:

    i have schema avsc file and byte array of serialized object .. How can i de-serialize in java?

  2. Murali says:

    Can you please help me, How to Deserialization in Dell Boomi?

Leave a Reply

Your email address will not be published. Required fields are marked *