Flume Data Flow – Types & Failure Handling in Apache Flume
Keeping you updated with latest technology trends, Join DataFlair on Telegram
1. Objective – Flume Data Flow
As we know we use Flume to move log data into HDFS. However, there is a pattern in which data flows in Apache Flume. Though there are various types of Data Flow in Flume. Such as Multi-hop Flow, Fan-out Flow, and Fan-in Flow with Flume Data Flow model. So, in this article, we will learn the Flume Data Flow as well as its types. Also, we will cover Failure Handling to understand this topic well.
2. Introduction to Flume Data Flow
Basically, we use a framework Flume to transfer log data into HDFS. However, we can say events and log data are generated by the log servers. Also, these servers have Flume agents running on them. Moreover, these agents receive the data from the data generators.
To be more specific, in Flume there is an intermediate node which collects the data in these agents, that nodes are what we call as Collector. As same as agents, in Flume, there can be multiple collectors.
Let’s revise Apache Flume Architecture & Flume Features
Afterwards, from all these collectors the data will be aggregated and pushed to a centralized store. Such as HBase or HDFS. To understand better, refer the following Flume Data Flow diagram, it explains Flume Data Flow model.
If these professionals can make a switch to Big Data, so can you:
Java → Big Data Consultant, JDA
PeopleSoft → Big Data Architect, Hexaware
3. Types of Data Flow in Flume
a. Multi-hop Flow
Basically, before reaching the final destination there can be multiple agents and an event may travel through more than one agent, within Flume. This is what we call as multi-hop Data flow in Flume.
Let’s revise Flume Troubleshooting – Flume Known Issues in detail
b. Fan-out Flow
In very simple language when data transfers or the data flow from one source to multiple channels that is what we call fan-out flow. Basically, in Flume Data flow, it is of two categories −
It is the data flow where the data will be replicated in all the configured channels.
On defining Multiplexing we can say the data flow where the data will be sent to a selected channel which is mentioned in the header of the event.
Read about Apache Flume Source & Flume Event Serializers
c. Fan-in Flow
While it comes to fan-in flow it is known as the data flow in which the data will be transferred from many sources to one channel.
4. Flume Failure Handling
However, for each event, there are two transactions which take place. They are one at the sender and one at the receiver. Basically, the sender sends events to the receiver. Although, the receiver commits its own transaction and sends a “received” signal to the sender, soon after receiving the data. Thus, the sender commits its transaction just after receiving the signal.
Note: Make sure that the sender will not commit its transaction till it receives a signal from the receiver.
So, in this Apache Flume Data Flow article, we have learned all the types of Flume Data Flow possible in Apache Flume, Multi-hop Flow, Fan-out Flow, Fan-in Flow, Failure Handling with Flume Data Flow model. Hope this content helps you, If you have any query, feel free to ask in the comment section.
Apache Flume Installation
Best Apache Flume Books to Learn Flume