Flume Event Serializers – Apache Flume

Boost your career with Free Big Data Courses!!

Apache Flume is a distributed system for transferring data from external sources to the HDFS or HBase. The event serializer is a mechanism that is used for converting a flume event into another format for the output.

In this article, you will explore what event serializer is. The article covers different EventSerializers that ship with Apache Flume. You will see different event serializers along with their configuration options and examples.

Introduction to Event Serializers

Event serializers is a mechanism which is used for converting a flume event into another format for the output. The Event serializer is an interface that allows for the random serialization of an event. The hdfs sink and the file_roll sink both support the EventSerializer interface.

Event Serializers are similar in function to the Layout class in log4j. There are different types of Flume event serializers. The text serializer outputs the flume event body. The avro_event serializer is used for creating an Avro representation of the event.

Let us now explore the EventSerializers that ship with Flume in detail.

1. Body Text Serializer

It is the default serializer that writes the body of the flume event without any modification or transformation to the output stream. It ignores the event headers. If the headers exist on the flume event, then they will be discarded

The Configuration options for Body Text serializer are as follows:

Property NameDefault ValuesDescription
appendNewlinetrueIt specifies whether a newline will be appended to each flume event at write time. The default value is true which assumes that flume events do not contain newlines, for legacy reasons.

Example for agent named agent1, sink name sk1, and channel ch1:

agent1.sinks = sk1
agent1.sinks.sk1.type = file_roll
agent1.sinks.sk1.channel = ch1
agent1.sinks.sk1.sink.directory = /var/log/flume
agent1.sinks.sk1.sink.serializer = text
agent1.sinks.sk1.sink.serializer.appendNewline = false

2. “Flume Event” Avro Event Serializer

Alias: avro_event

This Event Serializer serializes the Flume events into an Avro container file. It uses the same schema which is used for the Flume events in the Avro RPC mechanism. This event serializer inherits from the AbstractAvroEventSerializer class.

The Configuration options for “Flume Event” Avro Event Serializer are as follows:

Property NameDefault ValuesDescription
syncIntervalBytes2048000It specifies Avro sync interval, in approximate bytes.
compressionCodecnullIt specifies the Avro compression codec. For the supported codecs, go through the Avro’s CodecFactory docs.

Example for agent named agent1, sink name sk1, and channel ch1:

agent1.sinks.sk1.type = hdfs
agent1.sinks.sk1.channel = ch1
agent1.sinks.sk1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
agent1.sinks.sk1.serializer = avro_event
agent1.sinks.sk1.serializer.compressionCodec = snappy

3. Avro Event Serializer

Alias

It doesn’t have an alias. It must be specified using the qualified class name.

This event serializer also serializes Flume events into an Avro container file just like the “Flume Event” Avro Event Serializer. However, the record schema is configurable and can be specified either as an Apache Flume configuration property or can be passed in a flume event header.

For passing the record schema as a part of the Apache Flume configuration property, use the property schemaURL.

For passing the record schema in the flume event header you can choose any one of the following ways:

  • Either specify the event header flume.avro.schema.literal containing the JSON-format representation of the schema
  • Specify the flume.avro.schema.url with a URL where the schema may be found.

This event serializer inherits from the AbstractAvroEventSerializer class.

The Configuration options for Avro Event Serializer are as follows:

Property NameDefault ValuesDescription
syncIntervalBytes2048000It specifies Avro sync interval, in approximate bytes.
compressionCodecnullIt specifies the Avro compression codec. For the supported codecs, go through the Avro’s CodecFactory docs.
schemaURLnullIt specifies the Avro schema URL. The Schemas specified in the header overrides this option.

Example for agent named agent1, sink name sk1, and channel ch1:

agent1.sinks.sk1.type = hdfs
agent1.sinks.sk1.channel = ch1
agent1.sinks.sk1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
agent1.sinks.sk1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
agent1.sinks.sk1.serializer.compressionCodec = snappy
agent1.sinks.sk1.serializer.schemaURL = hdfs://namenode/path/to/schema.avsc

Summary

In short, we can say that the event serializer is a mechanism that converts flume events into other formats for the output. There are different types of EventSerializers that ship with Apache Flume.

The different Event Serializers are Body Text serializer, “Flume Event” Avro Event Serializer, and Avro Event Serializer. The article had explained all these in detail along with their configuration properties and examples.

Did you like our efforts? If Yes, please give DataFlair 5 Stars on Google

courses

DataFlair Team

The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.

Leave a Reply

Your email address will not be published. Required fields are marked *