Distributed Cache in Hadoop: Most Comprehensive Guide


1. Distributed Cache in Hadoop: Objective

In our blog about Hadoop distributed cache you will learn what is distributed cache in Hadoop, Working and implementations of distributed cache in Hadoop framework. This tutorial also covers various Advantages of Distributed Cache, limitations of Apache Hadoop Distributed Cache.

Introduction to Distributed Cache In Hadoop

Distributed Cache in Hadoop

2. Introduction to Hadoop

Apache Hadoop is an open-source software framework. It is a system for distributed storage and processing of large data sets. Hadoop follows master slave architecture. In which master is NameNode and slave is DataNode. Namenode stores meta-data i.e. number of blocks, their location, replicas. Datanode stores actual data in HDFS. And it perform read and write operation as per request for the client.

In Hadoop, data chunks process in parallel among Datanodes, using a program written by the user.  If we want to access some files from all the Datanodes, then we will put that file to distributed cache.

Read: Automatic Failover in Hadoop

3. What is Distributed Cache in Hadoop?

Distributed Cache is a facility provided by the Hadoop MapReduce framework. It cache files when needed by the applications. It can cache read only text files, archives, jar files etc. Once we have cached a file for our job, Hadoop will make it available on each datanodes where map/reduce tasks are running.

Thus, we can access files from all the datanodes in our map and reduce job.

3.1. Working and Implementation of Distributed Cache in Hadoop

First of all, an application which need to use distributed cache to distribute a file:

  • Should make sure that the file is available.
  • And also make sure that file can accessed via urls. Urls can be either hdfs: // or http://.

Now, if the file is present on the above urls, the user mentions it to be a cache file to the distributed cache. MapReduce job will copy the cache file on all the nodes before starting of tasks on those nodes.

The Process is as Follows:

  • Copy the requisite file to the HDFS:

$ hdfs dfs-put/user/dataflair/lib/jar_file.jar

  • Setup the application’s JobConf:

DistributedCache.addFileToClasspath(new Path (“/user/dataflair/lib/jar-file.jar”), conf)

  • Add it in Driver class.

3.2. Size of Distributed Cache in Hadoop

With cache size property in mapred-site.xml it is possible to control the size of distributed cache. By default size of Hadoop distributed cache is 10 GB.

Read: Important Features of Hadoop

4. Benefits of Distributed Cache in Hadoop

Below are some advantages of MapReduce Distributed Cache-

4.1. Store Complex Data

It distributes simple, read-only text file and complex types like jars, archives. These achieves are then un-archived at the slave node.

4.2. Data Consistency

Hadoop Distributed Cache tracks the modification timestamps of cache files. And it notifies that the files should not change until a job is executing. Using hashing algorithm, the cache engine can always determine on which node a particular key-value pair resides. Since, there is always a single state of the cache cluster, it is never inconsistent.

4.3. Single point of Failure

A distributed cache runs as an independent process across many nodes. Thus, failure of a single node does not result in a complete failure of the cache.

Read: How Hadoop works internally?

5. Overhead of Distributed Cache

A MapReduce distributed cache has overhead that will make it slower than an in-process cache:

5.1. Object serialization

A distributed cache must serialize objects. But the serialization mechanism has two major problems:

  • Very slow– Serialization uses reflection to inspect the type of information at runtime. Reflection is a very slow process as compared to pre-compiled code.
  • Very bulky– Serialization stores complete class name, cluster, and assembly details. It also stores references to other instances in member variables. All this makes the serialization very bulky.

6. Distributed Cache in Hadoop – Conclusion

In conclusion to Distributed cache in Hadoop, it is a mechanism that Hadoop MapReduce framework supports. Using distributed cache in Hadoop, we can broadcast small or moderate sized files (read only) to all the worker nodes. The distributed cache files will be deleted from worker node once the job runs successfully.

See Also-

 If you like this post or have any query about hadoop Distributed Caching, do leave a comment.