Difference between Flume and Kafka?

Viewing 1 reply thread
  • Author
    • #6650
      DataFlair Team

      What is the difference between Apache Flume and Kafka?
      What are the most significant differences between Flume and Kafka?

    • #6651
      DataFlair Team


      1) Kafka’s is a distributed publish-subscribe messaging system. Most of the development effort is involved with allowing subscribers to read exactly the messages they are interested in.
      2) Kafka is being a pull system, i.e. Kafka provides back pressure to prevent overflowing consumers, by persistently storing the incoming messages until they expire, so that late consumers can pick the messages up at their own pace.
      3) Kafka will store the data for some days or weeks or traffic, that would be able to be reprocessed any number of times, by any number of consumer groups, but most importantly, the create rate of those events will not overload the databases or the processes trying to get data into databases.
      4) It can be used for any system to connect to other systems that requires enterprise level messaging (website activity tracking, operational metrics, stream processing etc) It’s a general purpose publish subscribe or queue system, and can mesh with any producer or consumer
      5) Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time.
      6) High availability of events (recoverable in case of failures)

      1) Flume has been developed to ingest data into Hadoop. It is tightly integrated with Hadoop’s monitoring system, file system, file formats, and utilities. Mostly Flume development is to make it compatibility with Hadoop.
      2) Flume is a push system which implies data loss when consumers can’t keep up. Built around Hadoop ecosystem for the primary purpose of sending messages to HDFS & HBase.
      3) Flume is not as scalable as Kafka as adding more consumers to Flume means changing the topology of Flume pipeline design, replicating the channel to deliver the messages to a new sink. It is not really a scalable solution when you have a huge number of consumers. Also since the flume topology needs to be changed, it requires some down time.
      4) Flume does not replicate events – in case of flume-agent failure, you will lose events in the channel

      When to use:
      1) Flume: When working with non-relational data sources such as log files which are to be streamed into Hadoop.
      Kafka: When in need of highly reliable and scalable enterprise messaging system to connect many multiple systems (Including Hadoop)
      2) Kafka for Hadoop: Kafka is like a pipeline that collects data in real-time and pushes to Hadoop. Hadoop processes it inside and then as per the requirement either serves to other consumers (Dashboards, BI, etc) or stores it for further processing.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.