Apache Flume Tutorial – Flume Introduction, Features & Architecture

Stay updated with the latest technology trends while you're on the move - Join DataFlair's Telegram Channel

1. Objective

Apache flume is an open source data collection service for moving the data from source to destination. In this Apache Flume tutorial, we will discuss what is Apache Flume, what is the need of Flume, features of Apache Flume. This tutorial also covers the architecture of flume, how Flume works and various advantages of Apache Flume.

Apache Flume Tutorial - Flume Introduction, Features & Architecture

Apache Flume Tutorial – Flume Introduction, Features & Architecture

2. Apache Flume Tutorial – Introduction

Apache Flume is a tool used to collect, aggregate and transports large amounts of streaming data like log files, events, etc., from a number of different sources to a centralized data store (say Hadoop Distributed File System – HDFS).
Flume, a highly distributed, reliable, and configurable tool. Flume was mainly designed in order to collect streaming data (log data) from various web servers to HDFS.
To learn Hadoop HDFS feature follow this tutorial.

If these professionals can make a switch to Big Data, so can you:
Rahul Doddamani Story - DataFlair
Rahul Doddamani
Java → Big Data Consultant, JDA
Follow on
Mritunjay Singh Success Story - DataFlair
Mritunjay Singh
PeopleSoft → Big Data Architect, Hexaware
Follow on
Rahul Doddamani Success Story - DataFlair
Rahul Doddamani
Big Data Consultant, JDA
Follow on
I got placed, scored 100% hike, and transformed my career with DataFlair
Enroll now
Deepika Khadri Success Story - DataFlair
Deepika Khadri
SQL → Big Data Engineer, IBM
Follow on
DataFlair Web Services
You could be next!
Enroll now

3. Why Apache Flume?

A company has tons of services running on multiple servers. And lots of data (logs) produce by them, now we need to analyze them altogether. In order process that logs, we need a reliable, scalable, extensible and manageable distributed data collection service which can perform flow of unstructured data (logs) from one location to another where they will process (say in HDFS). Apache flume is an open source data collection service for moving the data from source to destination.
Apache Flume is the most reliable, distributed, and available service for systematically collecting, aggregating, and moving large amounts of streaming data (logs) into the Hadoop Distributed File System (HDFS). Based on streaming data flows, it has a simple and flexible architecture. It is highly fault-tolerant and robust and with tunable reliability mechanisms for fail-over and recovery. Flume allows data collection in batch as well as streaming mode.
Refer this guide to Learn HDFS Commands.

4. Flume Features

Some of the outstanding features of Flume are as follows:

  • From multiple servers, it collects the log data and ingests them into a centralized store (HDFS, HBase) efficiently.
  • With the help of Flume we can collect the data from multiple servers in real-time as well as in batch mode.
  • Huge volumes of event data generated by social media websites like Facebook and Twitter and various e-commerce websites such as Amazon and Flipkart can also be easily imported and analyzed in real-time.
  • Flume can collect data from a large set of sources and move them to multiple destinations.
  • Flume supports Multi-hop flows, fan-in fan-out flows, contextual routing, etc.
  • We can scale Flume horizontally.

4. Apache Flume Architecture

In this section of Apache Flume tutorial, we will discuss various components of Flume and how this components work-

  • Event: A single data record entry transported by flume is known as Event.
  • Source: Source is an active component that listens for events and writes them on one or channels. It is the main component with the help of which data enters into the Flume. It collects the data from a variety of sources, like exec, avro, JMS, spool directory, etc.
  • Sink: It is that component which removes events from a channel and delivers data to the destination or next hop. There are multiple sinks available that delivers data to a wide range of destinations. Example: HDFS, HBase, logger, null, file, etc.
  • Channel: It is a conduit between the Source and the Sink that queues event data as transactions. Events are ingested by the sources into the channel and drained by the sinks from the channel.
  • Agent: An Agent is a Java virtual machine in which Flume runs. It consists of sources, sinks, channels and other important components through which events can transfer from one place to another.
  • Client: Events generates by the clients and send to one or more agents.
Hadoop Quiz

5. Flume Tutorial – Advantages

Below are some important advantages of using Apache Flume:

  • We can store data into any of the centralized stores using Apache Flume.
  • Flume acts as a mediator between data producers and the centralized stores when the rate of incoming data exceeds the rate at which we can write data to the destination and provides a steady flow of data between them.
  • A feature of contextual routing is also provided by Flume.
  • Flume guarantees reliable message delivery as in Flume transactions are channel-based where two transactions (1 sender & 1 receiver) are maintained for each message.
  • Flume is highly fault-tolerant, reliable, manageable, scalable, and customizable.

So, this was all in Apache Flume Tutorial. Hope you like our explanation.

6. Conclusion – Apache Flume Tutorial

Hence, in this Flume tutorial, we discussed the meaning of Flume. Moreover, we saw Flume advantages and limitations. Also, we learned about Flume features and architecture. Along with this, we discussed the need for Apache Flume. Still, if you have any doubt regarding Apache Flume tutorial, you can ask in the comment tab.

Reference for Flume

6 Responses

  1. jhustec says:

    while collection can we process the data as well in flume in real-time ?

  2. fumio says:

    nice tutorial, i tried kinesis before flume, but flume is more feature rich.

  3. iwunuftuh says:

    Flume is a unique component which is very efficient and reliable. nice info..

  4. Sharnam says:

    What is the meaning of contextual routing with respect of Flume?

  5. praveen says:

    how does the data get into the spooling directory in the local file system that we have set for transferring data into hdfs?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.