Difference between Flume and Kafka?

This topic has 1 reply, 1 voice, and was last updated 7 years, 8 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- October 10, 2018 at 4:56 pm #6650
  
  DataFlair Team
  Spectator
  
  What is the difference between Apache Flume and Kafka?
  What are the most significant differences between Flume and Kafka?
- October 10, 2018 at 4:56 pm #6651
  
  DataFlair Team
  Spectator
  
  Kafka
  
  1) Kafka’s is a distributed publish-subscribe messaging system. Most of the development effort is involved with allowing subscribers to read exactly the messages they are interested in.
  2) Kafka is being a pull system, i.e. Kafka provides back pressure to prevent overflowing consumers, by persistently storing the incoming messages until they expire, so that late consumers can pick the messages up at their own pace.
  3) Kafka will store the data for some days or weeks or traffic, that would be able to be reprocessed any number of times, by any number of consumer groups, but most importantly, the create rate of those events will not overload the databases or the processes trying to get data into databases.
  4) It can be used for any system to connect to other systems that requires enterprise level messaging (website activity tracking, operational metrics, stream processing etc) It’s a general purpose publish subscribe or queue system, and can mesh with any producer or consumer
  5) Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time.
  6) High availability of events (recoverable in case of failures)
  
  Flume
  1) Flume has been developed to ingest data into Hadoop. It is tightly integrated with Hadoop’s monitoring system, file system, file formats, and utilities. Mostly Flume development is to make it compatibility with Hadoop.
  2) Flume is a push system which implies data loss when consumers can’t keep up. Built around Hadoop ecosystem for the primary purpose of sending messages to HDFS & HBase.
  3) Flume is not as scalable as Kafka as adding more consumers to Flume means changing the topology of Flume pipeline design, replicating the channel to deliver the messages to a new sink. It is not really a scalable solution when you have a huge number of consumers. Also since the flume topology needs to be changed, it requires some down time.
  4) Flume does not replicate events – in case of flume-agent failure, you will lose events in the channel
  
  When to use:
  1) Flume: When working with non-relational data sources such as log files which are to be streamed into Hadoop.
  Kafka: When in need of highly reliable and scalable enterprise messaging system to connect many multiple systems (Including Hadoop)
  2) Kafka for Hadoop: Kafka is like a pipeline that collects data in real-time and pushes to Hadoop. Hadoop processes it inside and then as per the requirement either serves to other consumers (Dashboards, BI, etc) or stores it for further processing.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

Difference between Flume and Kafka?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses