Explain Distributed Cache in Hadoop?

Viewing 1 reply thread
  • Author
    Posts
    • #6103
      DataFlair TeamDataFlair Team
      Spectator

      What is the need of distributed cache in Hadoop?

    • #6105
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop input data is divided into blocks and these blocks are stored on different data nodes. Mapreduce jobs process this data in parallel independently.

      But it can be possible that some files are required on all the data nodes for execution is MapReduce jobs.
      The framework will copy this file to slave node before any tasks for the job are executed on that node. In this way, Dstributed Cache minimizes data transfer by placing those files locally in data nodes.

      For Example – Consider WordCount Example which counts the no of times a word has occurred. Suppose we want to skip some words and don’t want to count them. So we can place these words in a file in Distributed cache.

      DistributedCache tracks modification timestamps of the cache files. The cache files should not be modified by the application or externally while the job is executing.

      These files will be deleted from Slave nodes once the task is over.

      For more detail follow: Dstributed Cache in Hadoop

Viewing 1 reply thread
  • You must be logged in to reply to this topic.