Data Transfer from Flume to HDFS – Load Log Data Into HDFS
We have studied that the main purpose of the design of Apache Flume is to transfer data from various sources to the Hadoop Distributed File System. In this article, we will see how we can transfer data from Apache Flume to HDFS. The article explains the data transfer from Flume to HDFS along with an example. It provides a practical implementation of writing log data from Flume to HDFS.
Before studying how data transfers from Flume to HDFS, let us first see what log data is.
Keeping you updated with latest technology trends, Join DataFlair on Telegram
What is Streaming or Log Data?
Usually, most of the data that we analyze is generated by various data sources such as social networking sites, cloud servers, application servers, and enterprise servers. The data generated by these sources are in the form of log files and events. The log file is a normal file that enlists events or actions that occur in the operating system.
For example, the web servers enlist every request made to it in the log files.
Analysis of such log data provides us the information about −
- the performance of the application and locate various hardware and software failures.
- users’ behavior and thus helps in deriving better business insights.
We can analyze such data in Apache Hadoop. But for analyzing such log data with Hadoop, the data must be stored in HDFS. We can use Apache Flume for transferring data from different web servers to HDFS.
Let us now see how data transfers from Flume to Hadoop HDFS.
Data Transfer from Flume to HDFS
Prerequisites for transferring data from Flume to HDFS
- You must have Hadoop installed on your system. Refer to Hadoop 3 installation guide for installing Hadoop in your system.
- Flume installation must be there on your machine. If not installed, follow the Flume installation guide for installing Apache Flume.
System Requirements for transferring data from Flume to HDFS
- Sufficient memory for the configurations used by flume sources, channels, or sinks.
- Sufficient disk space for the configurations used by flume channels or sinks
- Read/Write permissions for the directories used by the flume agent.
Setup for transferring Data from Flume to HDFS
For transferring data from Flume to any central repository such as HDFS, HBase, etc. we need to do the following setup.
1. Setting up the Flume agent
We store the Flume agent configuration in a local configuration file. This configuration file is a text file that follows the Java properties file format. We specify the configurations for one or more flume agents in the same configuration file.
This configuration file includes properties of each flume source, sink and channel in an agent. It also includes information about how source, sink, and channel are wired together to form data flows.
2. Configuring individual components
We have to configure each flume component. Each flume component such as source, sink, or channel in the flow have a name, type, and a rich set of properties that are specific to the component type and instantiation.
For example, an Avro source has a name, type, as well as needs a hostname and a port number to receive data from.
Similarly, a memory channel can have a max queue size (“capacity”).
The HDFS sink requires the file system URI, a path to creating files, etc. All such component attributes must be set in the properties file of the hosting Flume agent.
3. Writing the pieces together
The flume agent must know the individual components to load. The agent needs to know how is the connectivity of the components to constitute the flow. This can be done by listing the names of each of the flume sources, sinks, and channels in the flume agent. Along with the names, we have to specify the connecting channel for each source and sink.
Consider an example where the flume agent flows events from an Avro source called avro_web_server to the HDFS sink called hdfs_cluster_1 through a file channel called file-channel.
In such a case, the configuration file will contain the names of these components and file-channel as a shared channel for both the avro_web_server source and the hdfs_cluster_1 sink.
4. Starting an Agent
We can start the flume agent by using a shell script called flume-ng. This shell script is located in the bin directory of the Apache Flume distribution.
We need to specify agent name, config directory, and config file on the command line:
$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template
Now the flume agent will start running the source and sinks configured in the given properties file.
Let’s see through an example of how we define the flow from the flume to HDFS.
If these professionals can make a switch to Big Data, so can you:
Java → Big Data Consultant, JDA
PeopleSoft → Big Data Architect, Hexaware
Defining the flow from Flume to HDFS
For defining the flow within a single flume agent, we need to link the flume sources and sinks through a channel. We are required to list the flume sources, channels, and sinks for the given flume agent, and then link the flume source and sink to a channel.
A flume source can send events to multiple channels, but a flume sink consumes events from only one channel.
The generic format is as follows:
# list the sources, sinks and channels for the agent <Agent>.sources = <Source> <Agent>.sinks = <Sink> <Agent>.channels = <Channel1> <Channel2> # set channel for source <Agent>.sources.<Source>.channels = <Channel1> <Channel2> ... # set channel for sink <Agent>.sinks.<Sink>.channel = <Channel1>
Consider an example where we are having an agent with name agent_demo. The agent_demo is reading data from an external Avro client. It is then sending data to the HDFS through a memory channel.
The config file weblogs.config would look like:
# list the sources, sinks and channels for the agent agent_demo.sources = avro_src_1 agent_demo.sinks = hdfs_cluster_1 agent_demo.channels = mem_channel_1 # set channel for source agent_demo.sources.avro_src_1.channels = mem_channel_1 # set channel for sink agent_demo.sinks.hdfs_cluster_1.channel = mem_channel_1
This will make the data flow from avro_src_1 to hdfs_cluster_1 through the memory channel mem_channel_1.
When we start the agent_demo with the weblogs.config as its configuration file then it will instantiate that flow.
Configuring Individual Components
Now after defining the flow, we have to set the properties of each source, channel, and sink.
This can be done in the same hierarchical namespace fashion. Here we set the component type and other properties specific to each component.
The generic format is as follows:
# properties for sources <Agent>.sources.<Source>.<someProperty> = <someValue> # properties for channels <Agent>.channel.<Channel>.<someProperty> = <someValue> # properties for sinks <Agent>.sources.<Sink>.<someProperty> = <someValue>
The property “type” is the required property that we have to set for each component for Flume for understanding what kind of object it needs to be.
Each type of flume source, sink, and channel have their own set of properties required for their function as intended. We need to set all those properties as needed.
In the above example, we are having a flow from the avro_src_1 source to the hdfs_cluster_1 sink via the memory channel mem_channel_1.
Below is the example that shows configuration of each of those components:
agent_demo.sources = avro_src_1 agent_demo.sinks = hdfs_cluster_1 agent_demo.channels = mem_channel_1 # set channel for sources, sinks # properties of avro_src_1 agent_demo.sources. avro_src_1.type = avro agent_demo.sources. avro_src_1.bind = localhost agent_demo.sources. avro_src_1.port = 10000 # properties of mem_channel_1 agent_demo.channels.mem_channel_1.type = memory agent_demo.channels.mem_channel_1.capacity = 1000 agent_demo.channels.mem_channel_1.transactionCapacity = 100 # properties of hdfs-Cluster1-sink agent_demo.sinks.hdfs_cluster_1.type = hdfs agent_demo.sinks.hdfs_cluster_1.hdfs.path = hdfs://namenode/flume/webdata #...
Hence, in this manner data is written from Flume to HDFS.
A single Flume agent can also contain several independent flows.
For adding multiple flows in a flume agent we can list multiple sources, channels, and sinks in a configuration file. Let us see how to add multiple flows in a flume agent.
Adding Multiple Flows in a Flume Agent
For adding multiple flows in a flume agent we can list multiple sources, channels, and sinks in a configuration file. These components are then linked to form multiple flows.
The generic format is as follows:
# list the sources, sinks and channels for the agent <Agent>.sources = <Source1> <Source2> <Agent>.sinks = <Sink1> <Sink2> <Agent>.channels = <Channel1> <Channel2>
Then we have to link the sources and sinks to their corresponding channels setting up two different flows.
Consider an example where we will set two flows in a flume agent named agent_demo. One is going from an external avro client to the external HDFS. Second is going from the output of a tail to an avro sink.
Here is a configuration to do that:
In the weblogs.config we have to list all the flume sources, sinks, and channels as follows:
# list the sources, sinks and channels in the agent agent_demo.sources = avro_src_1 exec_tail_src_2 agent_demo.sinks = hdfs_cluster_1 avro_forward_sink_2 agent_demo.channels = mem_channel_1 file_channel_2 # flow #1 configuration agent_demo.sources.avro_src_1.channels = mem_channel_1 agent_demo.sinks.hdfs_cluster_1.channel = mem_channel_1 # flow #2 configuration agent_demo.sources.exec_tail_src_2.channels = file_channel_2 agent_demo.sinks.avro_forward_sink_2.channel = file_channel_2
Thus in this way, we can add multiple flows to a flume agent.
I hope after reading this article, you clearly understand how to transfer data from Flume to HDFS. For transferring data we need to set up a configuration file that contains that list of sources, sinks, and channels. The file also includes the properties specific to source, channel, and sink type. The article also explains how we can add multiple flows to a flume agent.