Apache Flume Channel | Types of Channels in Flume

Boost your career with Data Engineering Courses!!

It’s time to explore different Apache Flume Channel with properties and examples.

In this article, we will explore different Flume channels. The article describes various flume channels like memory channel, JDBC channel, Kafka channel, File channel, and spillable memory channel.

The article also covers the Pseudo transactional channel and custom channel. You will explore various flume channels along with properties and examples.

Let us first see a short introduction to the Flume channel.

Introduction to Flume channel

Flume channel is one of the components of a Flume agent. It sits in between flume sources and flume sinks. Channels ensure no loss of data in Flume. Sources write data to the Flume channel.

The data written to the Flume channel are consumed by Flume sinks. A flume sink can read events from only one channel while multiple sinks can read data from the same channel. They are transactional.

In simple words, it is a passive store that stores data received from sources until the data is consumed by the sink. They are the repositories where the flume events are staged on an agent.
In Flume channels, the Sources adds the events and the Sinks removes it.

Let us now explore different Flume channels.

1. Memory Channel

The memory channel is an in-memory queue where the sources write events to its tail and sinks read the events from its head. The memory channel stores the event written to it by the sources on the heap. We can configure the max size.

Since it stores all data in memory thus provides high throughput. It is best for those flows where data loss is not a concern. It is not suitable for the data flows where data loss is concerned.

Some of the properties for Memory channel are:

Property Name	Default values	Description
type		The component type name must be memory.
capacity	100	It specifies the maximum number of events the channel can store.
transactionCapacity		It specifies the maximum number of events that the channel will take from a flume source or give it to a flume sink per transaction.
keep-alive	3	This property specifies the timeout in seconds for adding or removing an event.
byteCapacityBufferPercentage	20	It defines the percentage of buffer in between the byteCapacity and the estimated total size of all flume events in the channel,in order to account for data in the headers.
byteCapacity		It specifies the maximum total bytes of memory which are allowed as a sum of all flume events in the memory channel. The implementation only counts for the Event body, which is the reason for providing configuration parameters of the byteCapacityBufferPercentage as well.

Example for Memory channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 10000
agent1.channels.ch1.transactionCapacity = 10000
agent1.channels.ch1.byteCapacityBufferPercentage = 20
agent1.channels.ch1.byteCapacity = 800000

2. JDBC Channel

The JDBC channel stores the Flume events on persistent storage that is backed by a database. It currently supports embedded Derby.
It is best for the flows where recoverability is important.

Some of the properties for JDBC channel are:

Property Name	Default values	Description
type	–	The component type name must be jdbc.
db.type	DERBY	It specifies the Database vendor. It needs to be DERBY.
driver.class	org.apache.derby.jdbc.EmbeddedDriver	It specifies the Class for vendor’s JDBC driver
driver.url	(constructed from other properties)	It specifies the JDBC connection URL
db.username	“sa”	It specifies the User id for db connection
db.password	–	It specifies the password for db connection
connection.properties.file	–	It specifies the JDBC Connection property file path
maximum.capacity	0 (unlimited)	It specifies the maximum number of events in the channel

Example for JDBC channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = jdbc

3. Kafka Channel

The Kafka channel stores flume events in a Kafka cluster which must be installed separately.

The Kafka channel provides high availability and replication. If the agent or a Kafka broker crashes, then the flume events are immediately available to the other sinks

We can use Kafka channel for multiple scenarios:

a) With Flume source and sink

This provides a highly available and reliable channel for events.

b) With Flume source and interceptor but no sink

This allows writing of Flume events into a Kafka topic, for use by other applications.

c) With Flume sink, but no source

This is a low-latency, fault-tolerant way of sending flume events from Kafka to Flume sinks like HDFS, HBase or Solr

Presently it supports Kafka server 0.10.1.0 releases or higher.

The testing was done up to Kafka 2.0.1 which was the highest Kafka version during release.

The configuration parameters were organized as such:

1. Configuration values which are related to the channel generically were applied at the channel config level.

For eg: agent1.channel.ch1.type =

2. Configuration values which are related to Kafka or how the Channels operates were prefixed with “kafka.”

Foreg: agent1.channels.ch1.kafka.topic and agent1.channels.ch1.kafka.bootstrap.servers.

3. The properties which are specific to the producer/consumer are prefixed as kafka.producer or kafka.consumer

4. Wherever possible, the Kafka parameter names are used
For eg: bootstrap.servers and sacks.

Some properties for Kafka are:

Property Name	Default values	Description
type	–	The component type name must be org.apache.flume.channel.kafka.KafkaChannel.
kafka.bootstrap.servers	–	It specifies the list of brokers in the Kafka cluster used by the channel. It can be a partial list of brokers. We recommend at least two for HA. The format is a comma-separated list of hostname:port
kafka.topic	flume-channel	It specifies the Kafka topic which the Kafka channel will use
kafka.consumer.group.id	flume	It specifies the Consumer group ID the channel uses to register with Kafka. Multiple channels must use the same topic and group to ensure that when one agent fails another can get the data Note: Having non-channel consumers with the same ID can lead to data loss.

Example for Kafka channel for agent named agent1 and channel ch1:

agent1.channels.ch1.type= org.apache.flume.channel.kafka.KafkaChannel
agent1.channels.ch1.kafka.bootstrap.servers= kafka-1:9092,kafka-2:9092,kafka-3:9092
agent1.channels.ch1.kafka.topic = channel1
agent1.channels.ch1.kafka.consumer.group.id = flume-consumer

4. File Channel

It is the Flume’s persistent channel. File channel writes out all the flume events to the disk. It does not lose data even when the process or machine shuts down or crashes.

This channel ensures that any events which are committed into the channel are removed from it only when a sink consumes the events and commits the transaction. It does this even if the machine/agent crashed and was restarted.

The file channel is highly concurrent and handles several flume sources and sinks at the same time.

It is best for flows where we require data durability and can’t tolerate data loss.
As this channel writes data onto the disk, so it does not lose any data even on crash or failure. This is also advantageous for having a very large capacity, especially when compared to the Memory Channel.

Some properties for the File channel are:

Property Name	Default values	Description
type	–	The component type name must be file.
checkpointDir	~/.flume/file-channel/checkpoint	It specifies the directory where checkpoint file will be stored
dataDirs	~/.flume/file-channel/data	It specifies the comma-separated list of directories for storing log files. Use of multiple directories on the separate disks improves the file channel performance.
transactionCapacity	10000	It specifies the maximum size of transaction supported by the channel
capacity	1000000	It specifies the maximum capacity of the channel

Example for File channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = file
agent1.channels.ch1.checkpointDir = /mnt/flume/checkpoint
agent1.channels.ch1.dataDirs = /mnt/flume/data

5. Spillable Memory Channel

The Spillable memory channel stores events in an in-memory queue and on disk. The in-memory queue is the primary store and the disk is the overflow.
Using the embedded File channel, the disk store is managed.

In case when the in-memory queue gets full then the File channel stores the additional incoming events.

It is best for flows that require high throughput of the memory channel during normal operation, and at the same time, the flow requires the larger capacity of the file channel for good tolerance of intermittent flume sink side outages or drop in drain rates.

During the agent crash or restart, only the flume events which are stored on disk are recovered when the flume agent comes online. The Spillable memory channel is currently experimental and is not recommended for use in the production.

Some of the properties for the Spillable memory channel are:

Property Name	Default values	Description
type	–	The component type name must be SPILLABLEMEMORY.
memoryCapacity	10000	It specifies the maximum number of events stored in-memory queue. Set this to zero if you want to disable the use of an in-memory queue.
overflowCapacity	100000000	It specifies the maximum number of events stored in the overflow disk. Set this to zero if you want to disable the use of overflow disk (i.e file channel)
avgEventSize	500	It specifies the estimated average size of events (in bytes) which are going into the channel
byteCapacityBufferPercentage	20	It defines the percentage of buffer in between the byteCapacity and the estimated total size of all flume events in the channel,in order to account for data in the headers.
byteCapacity		It specifies the maximum total bytes of memory which are allowed as a sum of all flume events in the memory channel. The implementation only counts for the Event body, which is the reason for providing configuration parameters of the byteCapacityBufferPercentage as well.

Example for Spillable memory channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = SPILLABLEMEMORY
agent1.channels.ch1.memoryCapacity = 10000
agent1.channels.ch1.overflowCapacity = 1000000
agent1.channels.ch1.byteCapacity = 800000
agent1.channels.ch1.checkpointDir = /mnt/flume/checkpoint
agent1.channels.ch1.dataDirs = /mnt/flume/data

Program for disabling the use of in-memory queue and functioning like a file channel:

agent1.channels = ch1
agent1.channels.ch1.type = SPILLABLEMEMORY
agent1.channels.ch1.memoryCapacity = 0
agent1.channels.ch1.overflowCapacity = 1000000
agent1.channels.ch1.checkpointDir = /mnt/flume/checkpoint
agent1.channels.ch1.dataDirs = /mnt/flume/data

Example for disabling use of overflow disk that is file channel and functioning purely like a in-memory channel:

agent1.channels = ch1
agent1.channels.ch1.type = SPILLABLEMEMORY
agent1.channels.ch1.memoryCapacity = 100000
agent1.channels.ch1.overflowCapacity = 0

6. Pseudo Transactional Channel

It is designed only for testing purposes. It must not be used in production.

Some of the properties for Pseudo transactional channel are:

Property Name	Default values	Description
type	–	The component type name must be org.apache.flume.channel.PseudoTxnMemoryChannel.
capacity	50	It specifies the max number of events the channel can store.
keep-alive	3	It specifies the timeout in seconds for adding or removing an event

7. Custom Channel

It is our own implementation for the Channel interface. We must include the custom channel’s class and its dependencies in the agent’s classpath while starting the Flume agent. The custom channel type must be its FQCN.

Some of the property for the custom channel is:

Property Name	Default values	Description
type	–	The component type name must be FQCN.

Example for Custom channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = org.example.MyChannel

Summary

I hope after reading this article you clearly understood different channels in the flume. Now you can set configuration properties of different flume channels.

The article has explained Flume channels like memory channel, Kafka channel, file channel, pseudo transactional channel, JDBC channel, custom channel, and spillable memory channel.

If you are Happy with DataFlair, do not forget to make us happy with your positive feedback on Google

Apache Flume Channel | Types of Channels in Flume

Introduction to Flume channel

1. Memory Channel

2. JDBC Channel

3. Kafka Channel

a) With Flume source and sink

b) With Flume source and interceptor but no sink

c) With Flume sink, but no source

4. File Channel

5. Spillable Memory Channel

6. Pseudo Transactional Channel

7. Custom Channel

Summary

Leave a Reply Cancel reply

About DataFlair

Trending Courses

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Data Science Tutorials

Trending Projects

Trending Programming Tutorials

Trending Tutorials