What type of data we should put in Distributed Cache?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 3:06 pm #5418
  
  DataFlair Team
  Spectator
  
  What type of data we should put in Distributed Cache?
  When to put the data in DC? is there any best practices for this ?
  How much volume we should put in? what is the upper limit ?
- September 20, 2018 at 3:07 pm #5420
  
  DataFlair Team
  Spectator
  
  <div class=”post”>
  
  MapReduce framework provides Distributed Cache to caches files needed by the applications.
  It can cache read-only text/data files, archives, jar files etc.
  Once we have cached a file for our job. Then, Hadoopwill make it available on each datanodes where map/reduce tasks are running, so one can easily access files from all the datanodes in our map and reduce job.
  
  An application which needs to use distributed cache should make sure that the files are available on URLs. URLs can be either hdfs:// or http://. Now, if the file is present on the mentioned urls, then the user mentions it to be cache file to distributed cache. This framework will copy the cache file on all the nodes before starting of tasks on those nodes.
  By default size of distributed cache is 10 GB. We can adjust the size of distributed cache usinglocal.cache.size.
  
  Benefits of Distributed Cache:
  
  1) Data Consistency- Distributed Cache tracks the modification timestamps of cache files. After that, it notifies that the files should not change until a job is executing. The cache engine by using hashing algorithm can determine on which node a particular key-value resides. There is always a single state of the cache cluster, it is never inconsistent.
  2) Store complex data- It distributes read-only text/data file and complex types like jars, achieves. At the slave node, these achieve are un-archived.
  3) Single point of Failure- A distributed cache runs as an independent process across many nodes, thus failure of a single node does not result in a complete failure of the cache.
  
  Follow the link to learn more about Distributed Cache in Hadoop
  
  </div>
- September 20, 2018 at 3:07 pm #5423
  
  DataFlair Team
  Spectator
  
  In Hadoop data are processed in chunks which are done independently. If you want a file to be accessed by all Datanodes you can cache the file which is called Distributed Cache in Hadoop.
  
  It can cache read-only text files, archives, jar files etc.
  
  Files can be accessed via urls either hdfs :// or http ://.
  
  Once you cache file,Mapreduce framework will make sure to copy the file on all Datanodes via Jobconf.
  
  By default, size of distributed cache is 10 GB. We can adjust the size of distributed cache using local.cache.size in mapred-site.xml.
  But its better to keep a few MBs of data in Distributed Cache. Otherwise it will affect the performance of your application.
  
  For more detail follow: Distributed Cache in Hadoop
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What type of data we should put in Distributed Cache?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses