In Hadoop input data is divided into blocks and these blocks are stored on different data nodes. Mapreduce jobs process this data in parallel independently.
But it can be possible that some files are required on all the data nodes for execution is MapReduce jobs.
The framework will copy this file to slave node before any tasks for the job are executed on that node. In this way, Dstributed Cache minimizes data transfer by placing those files locally in data nodes.
For Example – Consider WordCount Example which counts the no of times a word has occurred. Suppose we want to skip some words and don’t want to count them. So we can place these words in a file in Distributed cache.
DistributedCache tracks modification timestamps of the cache files. The cache files should not be modified by the application or externally while the job is executing.
These files will be deleted from Slave nodes once the task is over.
For more detail follow: Dstributed Cache in Hadoop