Apache Flume Channel | Types of Channels in Flume
It’s time to explore different Apache Flume Channel with properties and examples.
In this article, we will explore different Flume channels. The article describes various flume channels like memory channel, JDBC channel, Kafka channel, File channel, and spillable memory channel.
The article also covers the Pseudo transactional channel and custom channel. You will explore various flume channels along with properties and examples.
Let us first see a short introduction to the Flume channel.
Introduction to Flume channel
Flume channel is one of the components of a Flume agent. It sits in between flume sources and flume sinks. Channels ensure no loss of data in Flume. Sources write data to the Flume channel.
The data written to the Flume channel are consumed by Flume sinks. A flume sink can read events from only one channel while multiple sinks can read data from the same channel. They are transactional.
In simple words, it is a passive store that stores data received from sources until the data is consumed by the sink. They are the repositories where the flume events are staged on an agent.
In Flume channels, the Sources adds the events and the Sinks removes it.
Let us now explore different Flume channels.
1. Memory Channel
The memory channel is an in-memory queue where the sources write events to its tail and sinks read the events from its head. The memory channel stores the event written to it by the sources on the heap. We can configure the max size.
Since it stores all data in memory thus provides high throughput. It is best for those flows where data loss is not a concern. It is not suitable for the data flows where data loss is concerned.
Some of the properties for Memory channel are:
Property Name | Default values | Description |
type | The component type name must be memory. | |
capacity | 100 | It specifies the maximum number of events the channel can store. |
transactionCapacity | It specifies the maximum number of events that the channel will take from a flume source or give it to a flume sink per transaction. | |
keep-alive | 3 | This property specifies the timeout in seconds for adding or removing an event. |
byteCapacityBufferPercentage | 20 | It defines the percentage of buffer in between the byteCapacity and the estimated total size of all flume events in the channel,in order to account for data in the headers. |
byteCapacity | It specifies the maximum total bytes of memory which are allowed as a sum of all flume events in the memory channel. The implementation only counts for the Event body, which is the reason for providing configuration parameters of the byteCapacityBufferPercentage as well. |
Example for Memory channel for agent named agent1 and channel ch1:
agent1.channels = ch1 agent1.channels.ch1.type = memory agent1.channels.ch1.capacity = 10000 agent1.channels.ch1.transactionCapacity = 10000 agent1.channels.ch1.byteCapacityBufferPercentage = 20 agent1.channels.ch1.byteCapacity = 800000
2. JDBC Channel
The JDBC channel stores the Flume events on persistent storage that is backed by a database. It currently supports embedded Derby.
It is best for the flows where recoverability is important.
Some of the properties for JDBC channel are:
Property Name | Default values | Description |
type | – | The component type name must be jdbc. |
db.type | DERBY | It specifies the Database vendor. It needs to be DERBY. |
driver.class | org.apache.derby.jdbc.EmbeddedDriver | It specifies the Class for vendor’s JDBC driver |
driver.url | (constructed from other properties) | It specifies the JDBC connection URL |
db.username | “sa” | It specifies the User id for db connection |
db.password | – | It specifies the password for db connection |
connection.properties.file | – | It specifies the JDBC Connection property file path |
maximum.capacity | 0 (unlimited) | It specifies the maximum number of events in the channel |
Example for JDBC channel for agent named agent1 and channel ch1:
agent1.channels = ch1 agent1.channels.ch1.type = jdbc
3. Kafka Channel
The Kafka channel stores flume events in a Kafka cluster which must be installed separately.
The Kafka channel provides high availability and replication. If the agent or a Kafka broker crashes, then the flume events are immediately available to the other sinks
We can use Kafka channel for multiple scenarios:
a) With Flume source and sink
This provides a highly available and reliable channel for events.
b) With Flume source and interceptor but no sink
This allows writing of Flume events into a Kafka topic, for use by other applications.
c) With Flume sink, but no source
This is a low-latency, fault-tolerant way of sending flume events from Kafka to Flume sinks like HDFS, HBase or Solr
Presently it supports Kafka server 0.10.1.0 releases or higher.
The testing was done up to Kafka 2.0.1 which was the highest Kafka version during release.
The configuration parameters were organized as such:
1. Configuration values which are related to the channel generically were applied at the channel config level.
For eg: agent1.channel.ch1.type =
2. Configuration values which are related to Kafka or how the Channels operates were prefixed with “kafka.”
Foreg: agent1.channels.ch1.kafka.topic and agent1.channels.ch1.kafka.bootstrap.servers.
3. The properties which are specific to the producer/consumer are prefixed as kafka.producer or kafka.consumer
4. Wherever possible, the Kafka parameter names are used
For eg: bootstrap.servers and sacks.
Some properties for Kafka are:
Property Name | Default values | Description |
type | – | The component type name must be org.apache.flume.channel.kafka.KafkaChannel. |
kafka.bootstrap.servers | – | It specifies the list of brokers in the Kafka cluster used by the channel. It can be a partial list of brokers. We recommend at least two for HA. The format is a comma-separated list of hostname:port |
kafka.topic | flume-channel | It specifies the Kafka topic which the Kafka channel will use |
kafka.consumer.group.id | flume | It specifies the Consumer group ID the channel uses to register with Kafka. Multiple channels must use the same topic and group to ensure that when one agent fails another can get the data Note: Having non-channel consumers with the same ID can lead to data loss. |
Example for Kafka channel for agent named agent1 and channel ch1:
agent1.channels.ch1.type= org.apache.flume.channel.kafka.KafkaChannel agent1.channels.ch1.kafka.bootstrap.servers= kafka-1:9092,kafka-2:9092,kafka-3:9092 agent1.channels.ch1.kafka.topic = channel1 agent1.channels.ch1.kafka.consumer.group.id = flume-consumer
4. File Channel
It is the Flume’s persistent channel. File channel writes out all the flume events to the disk. It does not lose data even when the process or machine shuts down or crashes.
This channel ensures that any events which are committed into the channel are removed from it only when a sink consumes the events and commits the transaction. It does this even if the machine/agent crashed and was restarted.
The file channel is highly concurrent and handles several flume sources and sinks at the same time.
It is best for flows where we require data durability and can’t tolerate data loss.
As this channel writes data onto the disk, so it does not lose any data even on crash or failure. This is also advantageous for having a very large capacity, especially when compared to the Memory Channel.
Some properties for the File channel are:
Property Name | Default values | Description |
type | – | The component type name must be file. |
checkpointDir | ~/.flume/file-channel/checkpoint | It specifies the directory where checkpoint file will be stored |
dataDirs | ~/.flume/file-channel/data | It specifies the comma-separated list of directories for storing log files. Use of multiple directories on the separate disks improves the file channel performance. |
transactionCapacity | 10000 | It specifies the maximum size of transaction supported by the channel |
capacity | 1000000 | It specifies the maximum capacity of the channel |
Example for File channel for agent named agent1 and channel ch1:
agent1.channels = ch1 agent1.channels.ch1.type = file agent1.channels.ch1.checkpointDir = /mnt/flume/checkpoint agent1.channels.ch1.dataDirs = /mnt/flume/data
5. Spillable Memory Channel
The Spillable memory channel stores events in an in-memory queue and on disk. The in-memory queue is the primary store and the disk is the overflow.
Using the embedded File channel, the disk store is managed.
In case when the in-memory queue gets full then the File channel stores the additional incoming events.
It is best for flows that require high throughput of the memory channel during normal operation, and at the same time, the flow requires the larger capacity of the file channel for good tolerance of intermittent flume sink side outages or drop in drain rates.
During the agent crash or restart, only the flume events which are stored on disk are recovered when the flume agent comes online. The Spillable memory channel is currently experimental and is not recommended for use in the production.
Some of the properties for the Spillable memory channel are:
Property Name | Default values | Description |
type | – | The component type name must be SPILLABLEMEMORY. |
memoryCapacity | 10000 | It specifies the maximum number of events stored in-memory queue. Set this to zero if you want to disable the use of an in-memory queue. |
overflowCapacity | 100000000 | It specifies the maximum number of events stored in the overflow disk. Set this to zero if you want to disable the use of overflow disk (i.e file channel) |
avgEventSize | 500 | It specifies the estimated average size of events (in bytes) which are going into the channel |
byteCapacityBufferPercentage | 20 | It defines the percentage of buffer in between the byteCapacity and the estimated total size of all flume events in the channel,in order to account for data in the headers. |
byteCapacity | It specifies the maximum total bytes of memory which are allowed as a sum of all flume events in the memory channel. The implementation only counts for the Event body, which is the reason for providing configuration parameters of the byteCapacityBufferPercentage as well. |
Example for Spillable memory channel for agent named agent1 and channel ch1:
agent1.channels = ch1 agent1.channels.ch1.type = SPILLABLEMEMORY agent1.channels.ch1.memoryCapacity = 10000 agent1.channels.ch1.overflowCapacity = 1000000 agent1.channels.ch1.byteCapacity = 800000 agent1.channels.ch1.checkpointDir = /mnt/flume/checkpoint agent1.channels.ch1.dataDirs = /mnt/flume/data
Program for disabling the use of in-memory queue and functioning like a file channel:
agent1.channels = ch1 agent1.channels.ch1.type = SPILLABLEMEMORY agent1.channels.ch1.memoryCapacity = 0 agent1.channels.ch1.overflowCapacity = 1000000 agent1.channels.ch1.checkpointDir = /mnt/flume/checkpoint agent1.channels.ch1.dataDirs = /mnt/flume/data
Example for disabling use of overflow disk that is file channel and functioning purely like a in-memory channel:
agent1.channels = ch1 agent1.channels.ch1.type = SPILLABLEMEMORY agent1.channels.ch1.memoryCapacity = 100000 agent1.channels.ch1.overflowCapacity = 0
6. Pseudo Transactional Channel
It is designed only for testing purposes. It must not be used in production.
Some of the properties for Pseudo transactional channel are:
Property Name | Default values | Description |
type | – | The component type name must be org.apache.flume.channel.PseudoTxnMemoryChannel. |
capacity | 50 | It specifies the max number of events the channel can store. |
keep-alive | 3 | It specifies the timeout in seconds for adding or removing an event |
7. Custom Channel
It is our own implementation for the Channel interface. We must include the custom channel’s class and its dependencies in the agent’s classpath while starting the Flume agent. The custom channel type must be its FQCN.
Some of the property for the custom channel is:
Property Name | Default values | Description |
type | – | The component type name must be FQCN. |
Example for Custom channel for agent named agent1 and channel ch1:
agent1.channels = ch1 agent1.channels.ch1.type = org.example.MyChannel
Summary
I hope after reading this article you clearly understood different channels in the flume. Now you can set configuration properties of different flume channels.
The article has explained Flume channels like memory channel, Kafka channel, file channel, pseudo transactional channel, JDBC channel, custom channel, and spillable memory channel.
Did you like our efforts? If Yes, please give DataFlair 5 Stars on Google