Apache Flume Architecture – Flume Agent, Event, Client
1. Flume Architecture – Objective
In this Apache Flume Tutorial, we will study Apache Flume Architecture and understand each Component of Flume Architecture: Flume Event, Flume Agent, and Flume Client. Furthermore, we will discuss components of Flume agent: Flume Source, Flume Channel Selector, and Flume Sink Processors. However, before we discuss the architecture of Flume, we will also learn brief introduction, Apache Flume Architecture diagram with Apache flume example to understand it well.
2. What is Flume?
Apache flume is a tool which we use to transport log files across a large number of hosts into HDFS. Moreover, we can say it is a highly reliable, distributed and configurable streaming data collection tool.
Read about Apache Flume Use Cases in detail
Now, let’s discuss Flume Architecture and its components in detail.
3. Apache Flume Architecture
Refer below image to understand, that here data generators (like Facebook, Twitter) generates data which gets collected by individual Flume agents running on them. Afterward, another agent is data collector which collects all the data from the agents which are aggregated. Then pushed it into a centralized store. Like HDFS or HBase.
a. Flume Events
The basic unit of the data which is transported inside Flume is what we call an Event. Generally, it contains a payload of the byte array. Basically, that can be transported from the source to the destination accompanied by optional headers. Moreover, refer the below image to understand the typical Flume event structure −
b. Flume Agent
However, in Apache Flume, an independent daemon process (JVM) is what we call an agent. At first, it receives events from clients or other agents. Afterward, it forwards it to its next destination that is sink or agent. Note that, it is possible that Flume can have more than one agent. Also, refer the below image to understand the Flume Agent.
Follow this link to know about Flume Agent Serializers in detail
According to the image, it contains three main components. Such as source, channel, and sink.
i. Flume Source
Basically, Flume source receives data from the data generators. Then it transfers it to one or more channels as Flume events.
There are various types of sources Apache Flume supports. Moreover, each source receives events from a specified data generator.
Like− Avro source, Thrift source, twitter 1% source etc.
Read more about Flume Source in detail
ii. Flume Channel
A transient store that receives the events from the source also buffers them till they are consumed by sinks is what we call a Flume channel. To be very specific it acts as a bridge between the sources and the sinks in Flume.
Basically, these channels can work with any number of sources and sinks are they are fully transactional.
Like − JDBC channel, File system channel, Memory channel, etc.
iii. Flume Sink
Generally, to store data into centralized stores like HBase and HDFS we use Flume sink component. Basically, it consumes events from the channels and then delivers it to the destination. Also, we can say that sink’s destination is might be another agent or the central stores.
Like − HDFS sink
In addition, it is very important to note that a flume agent can have many numbers of sources, sinks, and channels.
c. Flume Client
Those which generates events and then sent it to one or more agents is what we call Flume Clients.
4. Additional Components of Flume Agent
Apache Flume Architecture Components do not end yet, all the components we have seen above are the primitive components of Apache Flume Agent. However, there are some more components which play the vital role in transferring Flume events from the data generator to the centralized stores.
a. Apache Flume Interceptors
To alter/inspect flume events which are transferred between source and channel, we use Flume Interceptors.
b. Channel Selectors
To determine which channel we should select to transfer the data in case of multiple channels, we use Channel Selector. Channel selectors are generally of two types −
i. Default channel selectors − Replicating channel selectors which replicate all the events in each channel is what we call Default channel selectors.
ii. Multiplexing channel selectors − The Channel selectors which decide the channel to send an event based on the address in the header of that event are Multiplexing channel selectors.
Read more about Apache Flume Channel Selectors in detail
c. Sink Processors
We generally sink processors to invoke a particular sink from the selected group of sinks. Moreover, also to create failover paths for our sinks or load balance events across multiple sinks from a channel we use sink processors.
So, this was all in Flume Architecture. Hope you like our explanation.
5. Conclusion – Flume Architecture
In this Apache Flume Tutorial, we have seen Apache Flume Architecture, all the components of Architecture of Flume- Flume Agents, Flume Events, and Flume client to learn about Flume Architecture completely. Also, we have seen few additional components of Flume Agent: Flume Interceptors, Flume sink processors, and Flume Channel selectors in brief. Still, if you want to ask any doubt feel free to ask in the comment section.