Explain Distributed Cache in Hadoop?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 5:07 pm #6103
  
  DataFlair Team
  Spectator
  
  What is the need of distributed cache in Hadoop?
- September 20, 2018 at 5:08 pm #6105
  
  DataFlair Team
  Spectator
  
  In Hadoop input data is divided into blocks and these blocks are stored on different data nodes. Mapreduce jobs process this data in parallel independently.
  
  But it can be possible that some files are required on all the data nodes for execution is MapReduce jobs.
  The framework will copy this file to slave node before any tasks for the job are executed on that node. In this way, Dstributed Cache minimizes data transfer by placing those files locally in data nodes.
  
  For Example – Consider WordCount Example which counts the no of times a word has occurred. Suppose we want to skip some words and don’t want to count them. So we can place these words in a file in Distributed cache.
  
  DistributedCache tracks modification timestamps of the cache files. The cache files should not be modified by the application or externally while the job is executing.
  
  These files will be deleted from Slave nodes once the task is over.
  
  For more detail follow: Dstributed Cache in Hadoop
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.