Apache Flume Channel | Types of Channels in Flume

Boost your career with Free Big Data Courses!!

It’s time to explore different Apache Flume Channel with properties and examples.

In this article, we will explore different Flume channels. The article describes various flume channels like memory channel, JDBC channel, Kafka channel, File channel, and spillable memory channel.

The article also covers the Pseudo transactional channel and custom channel. You will explore various flume channels along with properties and examples.

Let us first see a short introduction to the Flume channel.

Introduction to Flume channel

Flume channel is one of the components of a Flume agent. It sits in between flume sources and flume sinks. Channels ensure no loss of data in Flume. Sources write data to the Flume channel.

The data written to the Flume channel are consumed by Flume sinks. A flume sink can read events from only one channel while multiple sinks can read data from the same channel. They are transactional.

In simple words, it is a passive store that stores data received from sources until the data is consumed by the sink. They are the repositories where the flume events are staged on an agent.
In Flume channels, the Sources adds the events and the Sinks removes it.

Let us now explore different Flume channels.

1. Memory Channel

The memory channel is an in-memory queue where the sources write events to its tail and sinks read the events from its head. The memory channel stores the event written to it by the sources on the heap. We can configure the max size.

Since it stores all data in memory thus provides high throughput. It is best for those flows where data loss is not a concern. It is not suitable for the data flows where data loss is concerned.

Some of the properties for Memory channel are:

Property NameDefault values Description
typeThe component type name must be memory. 
capacity100It specifies the maximum number of events the channel can store.
transactionCapacityIt specifies the maximum number of events that the channel will take from a flume source or give it to a flume sink per transaction.  
keep-alive3This property specifies the timeout in seconds for adding or removing an event.
byteCapacityBufferPercentage20It defines the percentage of buffer in between the byteCapacity and the estimated total size of all flume events in the channel,in order to account for data in the headers. 
byteCapacityIt specifies the maximum total bytes of memory which are  allowed as a sum of all flume events in the memory channel. The implementation only counts for the Event body, which is the reason for providing configuration parameters of the byteCapacityBufferPercentage as well.

Example for Memory channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 10000
agent1.channels.ch1.transactionCapacity = 10000
agent1.channels.ch1.byteCapacityBufferPercentage = 20
agent1.channels.ch1.byteCapacity = 800000

2. JDBC Channel

The JDBC channel stores the Flume events on persistent storage that is backed by a database. It currently supports embedded Derby.
It is best for the flows where recoverability is important.

Some of the properties for JDBC channel are:

Property NameDefault values Description
typeThe component type name must be jdbc. 
db.typeDERBYIt specifies the Database vendor. It needs to be DERBY.
driver.classorg.apache.derby.jdbc.EmbeddedDriverIt specifies the Class for vendor’s JDBC driver 
driver.url(constructed from other properties)It specifies the JDBC connection URL
db.username“sa”It specifies the User id for db connection
db.passwordIt specifies the password for db connection
connection.properties.fileIt specifies the JDBC Connection property file path
maximum.capacity0 (unlimited)It specifies the maximum number of events in the channel

Example for JDBC channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = jdbc

3. Kafka Channel

The Kafka channel stores flume events in a Kafka cluster which must be installed separately.

The Kafka channel provides high availability and replication. If the agent or a Kafka broker crashes, then the flume events are immediately available to the other sinks

We can use Kafka channel for multiple scenarios:

a) With Flume source and sink

This provides a highly available and reliable channel for events.

b) With Flume source and interceptor but no sink

This allows writing of Flume events into a Kafka topic, for use by other applications.

c) With Flume sink, but no source

This is a low-latency, fault-tolerant way of sending flume events from Kafka to Flume sinks like HDFS, HBase or Solr

Presently it supports Kafka server 0.10.1.0 releases or higher.

The testing was done up to Kafka 2.0.1 which was the highest Kafka version during release.

The configuration parameters were organized as such:

1. Configuration values which are related to the channel generically were applied at the channel config level.

For eg: agent1.channel.ch1.type =

2. Configuration values which are related to Kafka or how the Channels operates were prefixed with “kafka.”

Foreg: agent1.channels.ch1.kafka.topic and agent1.channels.ch1.kafka.bootstrap.servers.

3. The properties which are specific to the producer/consumer are prefixed as kafka.producer or kafka.consumer

4. Wherever possible, the Kafka parameter names are used
For eg: bootstrap.servers and sacks.

Some properties for Kafka are:

Property NameDefault values Description
typeThe component type name must be org.apache.flume.channel.kafka.KafkaChannel
kafka.bootstrap.serversIt specifies the list of brokers in the Kafka cluster used by the channel.

It can be a partial list of brokers. We recommend at least two for HA. The format is a comma-separated list of hostname:port

kafka.topicflume-channelIt specifies the Kafka topic which the Kafka channel will use
kafka.consumer.group.idflumeIt specifies the Consumer group ID the channel uses to register with Kafka. Multiple channels must use the same topic and group to ensure that when one agent fails another can get the data Note: Having non-channel consumers with the same ID can lead to data loss.

Example for Kafka channel for agent named agent1 and channel ch1:

agent1.channels.ch1.type= org.apache.flume.channel.kafka.KafkaChannel
agent1.channels.ch1.kafka.bootstrap.servers= kafka-1:9092,kafka-2:9092,kafka-3:9092
agent1.channels.ch1.kafka.topic = channel1
agent1.channels.ch1.kafka.consumer.group.id = flume-consumer

4. File Channel

It is the Flume’s persistent channel. File channel writes out all the flume events to the disk. It does not lose data even when the process or machine shuts down or crashes.

This channel ensures that any events which are committed into the channel are removed from it only when a sink consumes the events and commits the transaction. It does this even if the machine/agent crashed and was restarted.

The file channel is highly concurrent and handles several flume sources and sinks at the same time.

It is best for flows where we require data durability and can’t tolerate data loss.
As this channel writes data onto the disk, so it does not lose any data even on crash or failure. This is also advantageous for having a very large capacity, especially when compared to the Memory Channel.

Some properties for the File channel are:

Property NameDefault values Description
typeThe component type name must be file.
checkpointDir~/.flume/file-channel/checkpointIt specifies the directory where checkpoint file will be stored
dataDirs~/.flume/file-channel/dataIt specifies the comma-separated list of directories for storing log files. Use of multiple directories on the separate disks improves the  file channel performance.
transactionCapacity10000It specifies the maximum size of transaction supported by the channel
capacity1000000It specifies the maximum capacity of the channel

Example for File channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = file
agent1.channels.ch1.checkpointDir = /mnt/flume/checkpoint
agent1.channels.ch1.dataDirs = /mnt/flume/data

5. Spillable Memory Channel

The Spillable memory channel stores events in an in-memory queue and on disk. The in-memory queue is the primary store and the disk is the overflow.
Using the embedded File channel, the disk store is managed.

In case when the in-memory queue gets full then the File channel stores the additional incoming events.

It is best for flows that require high throughput of the memory channel during normal operation, and at the same time, the flow requires the larger capacity of the file channel for good tolerance of intermittent flume sink side outages or drop in drain rates.

During the agent crash or restart, only the flume events which are stored on disk are recovered when the flume agent comes online. The Spillable memory channel is currently experimental and is not recommended for use in the production.

Some of the properties for the Spillable memory channel are:

Property NameDefault values Description
typeThe component type name must be SPILLABLEMEMORY.
memoryCapacity10000It specifies the maximum number of events stored in-memory queue. Set this to zero if you want to disable the use of an in-memory queue.
overflowCapacity100000000It specifies the maximum number of events stored in the overflow disk. Set this to zero if you want to disable the use of overflow disk (i.e file channel)
avgEventSize500It specifies the estimated average size of events (in bytes) which are  going into the channel
byteCapacityBufferPercentage20It defines the percentage of buffer in between the byteCapacity and the estimated total size of all flume events in the channel,in order to account for data in the headers. 
byteCapacityIt specifies the maximum total bytes of memory which are  allowed as a sum of all flume events in the memory channel. The implementation only counts for the Event body, which is the reason for providing configuration parameters of the byteCapacityBufferPercentage as well.

Example for Spillable memory channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = SPILLABLEMEMORY
agent1.channels.ch1.memoryCapacity = 10000
agent1.channels.ch1.overflowCapacity = 1000000
agent1.channels.ch1.byteCapacity = 800000
agent1.channels.ch1.checkpointDir = /mnt/flume/checkpoint
agent1.channels.ch1.dataDirs = /mnt/flume/data

Program for disabling the use of in-memory queue and functioning like a file channel:

agent1.channels = ch1
agent1.channels.ch1.type = SPILLABLEMEMORY
agent1.channels.ch1.memoryCapacity = 0
agent1.channels.ch1.overflowCapacity = 1000000
agent1.channels.ch1.checkpointDir = /mnt/flume/checkpoint
agent1.channels.ch1.dataDirs = /mnt/flume/data

Example for disabling use of overflow disk that is file channel and functioning purely like a in-memory channel:

agent1.channels = ch1
agent1.channels.ch1.type = SPILLABLEMEMORY
agent1.channels.ch1.memoryCapacity = 100000
agent1.channels.ch1.overflowCapacity = 0

6. Pseudo Transactional Channel

It is designed only for testing purposes. It must not be used in production.

Some of the properties for Pseudo transactional channel are:

Property NameDefault values Description
typeThe component type name must be org.apache.flume.channel.PseudoTxnMemoryChannel.
capacity50It specifies the max number of events the channel can store.
keep-alive3It specifies the timeout in seconds for adding or removing an event

7. Custom Channel

It is our own implementation for the Channel interface. We must include the custom channel’s class and its dependencies in the agent’s classpath while starting the Flume agent. The custom channel type must be its FQCN.

Some of the property for the custom channel is:

Property NameDefault values Description
typeThe component type name must be FQCN. 

Example for Custom channel for agent named agent1 and channel ch1:

agent1.channels = ch1
agent1.channels.ch1.type = org.example.MyChannel

Summary

I hope after reading this article you clearly understood different channels in the flume. Now you can set configuration properties of different flume channels.

The article has explained Flume channels like memory channel, Kafka channel, file channel, pseudo transactional channel, JDBC channel, custom channel, and spillable memory channel.

Did you like our efforts? If Yes, please give DataFlair 5 Stars on Google

courses

DataFlair Team

The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.

Leave a Reply

Your email address will not be published. Required fields are marked *