Data Transfer from Flume to HDFS – Load Log Data Into HDFS
In this blog Data Transfer from Flume to HDFS we will learn the way of using Apache Flume to transfer data in Hadoop. Also, we will learn the usage of Hadoop put Command for data transfer from Flume to HDFS. Moreover, we will see the tools available to send the streaming data to HDFS, to understand well.
2. Data Transfer From Flume to HDFS
As we all know, Big Data is nothing but a collection of large datasets which we cannot process by traditional computing techniques. However, on analyzing the Big Data value results are obtained. Likewise, By using simple programming models, an open-source framework, Hadoop allows to store and process Big Data in a distributed environment across clusters of computers.
Read more about Apache Flume Features
3. What is Streaming / Log Data?
While it comes to Log Data, it is the data produced by various data sources and usually require to be analyzed. Data Sources like, applications servers, social networking sites, cloud servers and enterprise servers. So, that data is generally in the form of log files or events.
a. Log file
Log File is nothing but a file that lists all the events/actions occurring in an operating system. For example, by the web server, every request made to the server is listed in log files.
However, the information obtained on harvesting such log data is about –
- It is about the application performance. Also, about locate various software and hardware failures.
- Moreover, it is about the user behavior and derives better business insights.
4. HDFS Put Command
How to Use HDFS put Command for Data Transfer from Flume to HDFS?
Basically, in handling the log data, the main challenge is to move the logs produced by multiple servers to Hadoop environment.
In order to insert data into Hadoop and read from it, Hadoop File System Shell offers commands. So, by using put command we can insert the data:
$ Hadoop fs –put /path of the required file /path in HDFS where to save the file
Let’s see Flume Troubleshooting – Flume Known Issues
a. Problem with put Command
However, for Data Transfer from Flume to HDFS, we can use put command of Hadoop. Still, there are some drawbacks to it. Like:
- We can only transfer one file at a time by using put command when the data is generated at a much higher rate by the generators. Therefore, the analysis made on old data would be less accurate. So, we need a solution to transfer the real-time data.
- The data is needed to be packaged and should be ready for the upload, input command. Since the data is generated continuously by the webservers, packing would be a difficult task.
So, we need to overcome the drawbacks of put command. Also, need to transfer without delay, the “streaming data” from data generators to centralized stores (especially HDFS).
Let’s look at Apache Flume Channel Selectors & Data Flow
5. Problem with HDFS
One of the major issues is the file exists as a directory entry and the length of the file will be considered as zero till it is closed, in HDFS. To understand, the data written in the file will be lost, if a source is writing data into HDFS and the network was interrupted in the middle of the operation (without closing the file).
Hence, we need a reliable, configurable, and maintainable system to Data Transfer from Flume to HDFS.
6. Tools available to send the streaming data to HDFS
To transfer the streaming data (log files, events etc..,) from various sources to HDFS, there are following tools available:
a. Facebook’s Scribe
A very popular tool we use is Scribe, to aggregate and stream the log data. Very large number of nodes we can scale and is robust to network and node failures.
b. Apache Kafka
Basically, it develope by Apache Software Foundation. Apache Kafka is an open-source message broker. Moreover, we can handle feeds with high-throughput and low-latency by using Kafka.
c. Apache Flume
Basically, we use Flume for collecting aggregating and transporting large amounts of streaming data from various web servers to a centralized data store, a tool/service/data ingestion mechanism. Such as log data, events (etc…)
To be more specific, to transfer the streaming data from various sources to HDFS through reliable, distributed and configurable mode we use Apache Flume
Let’s see Apache Flume Interceptors & Event Serializers
As a result, we have seen that how Data Transfer from Flume to HDFS. Still, if any doubt occurs regarding Flume to HDFS data transferring, feel free to ask in the comment section.
See Also- Top Flume Interview Questions and Answers