What type of data we should put in Distributed Cache?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What type of data we should put in Distributed Cache?

Viewing 2 reply threads
  • Author
    Posts
    • #5418
      DataFlair TeamDataFlair Team
      Spectator

      What type of data we should put in Distributed Cache?
      When to put the data in DC? is there any best practices for this ?
      How much volume we should put in? what is the upper limit ?

    • #5420
      DataFlair TeamDataFlair Team
      Spectator

      <div class=”post”>

      MapReduce framework provides Distributed Cache to caches files needed by the applications.
      It can cache read-only text/data files, archives, jar files etc. 
      Once we have cached a file for our job. Then, Hadoopwill make it available on each datanodes where map/reduce tasks are running, so one can easily access files from all the datanodes in our map and reduce job.

      An application which needs to use distributed cache should make sure that the files are available on URLs. URLs can be either hdfs:// or http://. Now, if the file is present on the mentioned urls, then the user mentions it to be cache file to distributed cache. This framework will copy the cache file on all the nodes before starting of tasks on those nodes.
      By default size of distributed cache is 10 GB. We can adjust the size of distributed cache usinglocal.cache.size.

      Benefits of Distributed Cache:

      1) Data Consistency- Distributed Cache tracks the modification timestamps of cache files. After that, it notifies that the files should not change until a job is executing. The cache engine by using hashing algorithm can determine on which node a particular key-value resides. There is always a single state of the cache cluster, it is never inconsistent.
      2) Store complex data- It distributes read-only text/data file and complex types like jars, achieves. At the slave node, these achieve are un-archived.
      3) Single point of Failure- A distributed cache runs as an independent process across many nodes, thus failure of a single node does not result in a complete failure of the cache.

      Follow the link to learn more about Distributed Cache in Hadoop

      </div>

    • #5423
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop data are processed in chunks which are done independently. If you want a file to be accessed by all Datanodes you can cache the file which is called Distributed Cache in Hadoop.

      It can cache read-only text files, archives, jar files etc.

      Files can be accessed via urls either hdfs :// or http ://.

      Once you cache file,Mapreduce framework will make sure to copy the file on all Datanodes via Jobconf.

      By default, size of distributed cache is 10 GB. We can adjust the size of distributed cache using local.cache.size in mapred-site.xml.
      But its better to keep a few MBs of data in Distributed Cache. Otherwise it will affect the performance of your application.

      For more detail follow: Distributed Cache in Hadoop

Viewing 2 reply threads
  • You must be logged in to reply to this topic.