

{"id":2609,"date":"2017-05-19T12:39:52","date_gmt":"2017-05-19T12:39:52","guid":{"rendered":"http:\/\/data-flair.training\/blogs\/?p=2609"},"modified":"2018-11-16T13:58:11","modified_gmt":"2018-11-16T08:28:11","slug":"apache-spark-rdd-persistence-caching","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/","title":{"rendered":"RDD Persistence and Caching Mechanism in Apache Spark"},"content":{"rendered":"<h2>1. Objective<\/h2>\n<p>This blog covers the detailed view of<strong>\u00a0<a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-introduction-spark-comprehensive-tutorial\/\">Apache Spark<\/a><\/strong>\u00a0RDD Persistence and Caching. This tutorial gives the answers for &#8211; What is RDD persistence, Why do we need to call <strong><em>cache<\/em><\/strong> or <strong><em>persist<\/em><\/strong> on an RDD,<em>\u00a0<\/em>What\u00a0is the Difference between\u00a0Cache() and Persist() method in Spark, What are the different storage levels in spark to store the persisted RDD, How to Unpersist RDD?\u00a0The need of persisting RDD and various advantages of persistence are also discussed in this Spark tutorial.<\/p>\n<div id=\"attachment_42329\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-42329\" class=\"size-full wp-image-42329\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2.jpg\" alt=\"RDD Persistence and Caching Mechanism in Apache Spark\" width=\"1200\" height=\"628\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2.jpg 1200w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2-768x402.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2-1024x536.jpg 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2-520x272.jpg 520w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><p id=\"caption-attachment-42329\" class=\"wp-caption-text\">RDD Persistence and Caching Mechanism in Apache Spark<\/p><\/div>\n<h2>2. What is RDD Persistence and Caching in Spark?<\/h2>\n<p><span style=\"font-weight: 400\">Spark<strong>\u00a0<a href=\"http:\/\/data-flair.training\/blogs\/rdd-in-apache-spark\/\">RDD<\/a> <\/strong>persistence is an optimization technique in which saves the result of RDD evaluation. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.<\/span><br \/>\n<span style=\"font-weight: 400\">We can make persisted RDD through <strong>cache()<\/strong> and <strong>persist()<\/strong> methods. When we use the cache() method we can store all the RDD in-memory. We can persist the RDD in memory and use it efficiently across parallel operations. <\/span><br \/>\n<span style=\"font-weight: 400\">The difference between cache() and persist() is that using cache() the default storage level is <strong>MEMORY_ONLY<\/strong> while using persist() we can use various storage levels (described below).\u00a0<\/span><span style=\"font-weight: 400\">It is a key tool for an interactive algorithm. Because, when we persist RDD each node stores any partition of it that it computes in memory and makes it reusable for future use. This process speeds up the further computation ten times.<\/span><br \/>\n<span style=\"font-weight: 400\">When the RDD is computed for the first time, it is kept in memory on the node. The cache memory of the <a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-streaming-fault-tolerance\/\">Spark is fault tolerant<\/a> so whenever any partition of RDD is lost, it can be recovered by <a href=\"http:\/\/data-flair.training\/blogs\/rdd-transformations-actions-apis-apache-spark\/\">transformation Operation<\/a> that originally created it.<\/span><\/p>\n<h2>3. Need of Persistence in Apache Spark<\/h2>\n<p><span style=\"font-weight: 400\">In Spark, we can use some RDD\u2019s multiple times. If honestly, we repeat the same process of<strong> RDD<\/strong> <strong>evaluation<\/strong> each time it required or brought into action. This task can be time and memory consuming, especially for iterative algorithms that look at data multiple times. To solve the problem of repeated computation the technique of persistence came into the picture.<\/span><\/p>\n<h2>4. Benefits of RDD Persistence in Spark<\/h2>\n<p><span style=\"font-weight: 400\">There are some advantages of RDD caching and persistence mechanism in spark. It makes the whole system<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Time efficient<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Cost efficient<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Lessen the execution time.<\/span><\/li>\n<\/ul>\n<h2>5. Storage levels of Persisted RDDs<\/h2>\n<p>Using <strong>persist()<\/strong> we can use various storage levels to Store\u00a0Persisted RDDs in Apache Spark. Let&#8217;s discuss each RDD storage level one by one-<\/p>\n<h3>a. MEMORY_ONLY<\/h3>\n<p><span style=\"font-weight: 400\">In this storage level, RDD is stored as deserialized Java object in the JVM. If the size of RDD is greater than memory, It will not cache some partition and recompute them next time whenever needed. In this level the space used for storage is very high, the CPU computation time is low, the data is stored in-memory. It does not make use of the disk.<\/span><\/p>\n<h3>b. MEMORY_AND_DISK<\/h3>\n<p><span style=\"font-weight: 400\">In this level, RDD is stored as deserialized Java object in the JVM. When the size of RDD is greater than the size of memory, it stores the excess partition on the disk, and retrieve from disk whenever required. In this level the space used for storage is high, the CPU computation time is medium, it makes use of both in-memory and on disk storage.<\/span><\/p>\n<h3>c. MEMORY_ONLY_SER<\/h3>\n<p><span style=\"font-weight: 400\">This level of Spark store the RDD as serialized Java object (one-byte array per partition). It is more space efficient as compared to deserialized objects, especially when it uses fast serializer. But it increases the overhead on CPU. In this level the storage space is low, the CPU computation time is high and the data is stored in-memory. It does not make use of the disk.<\/span><\/p>\n<h3>d. MEMORY_AND_DISK_SER<\/h3>\n<p><span style=\"font-weight: 400\">It is similar to <strong>MEMORY_ONLY_SER<\/strong>, but it drops the partition that does not fits into memory to disk, rather than recomputing each time it is needed. In this storage level, <\/span><span style=\"font-weight: 400\">The space used for storage is low, the CPU computation time is high, it makes use of both in-memory and on disk storage.<\/span><\/p>\n<h3>e. DISK_ONLY<\/h3>\n<p><span style=\"font-weight: 400\">In this storage level, RDD is stored only on disk. The space used for storage is low, the CPU computation time is high and it makes use of on disk storage.<\/span><br \/>\nRefer this guide for the <a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-in-memory-computing\/\">detailed description of Spark in-memory computation<\/a>.<\/p>\n<h2>6. How to Unpersist RDD in Spark?<\/h2>\n<p><span style=\"font-weight: 400\">Spark monitor the cache of each node automatically and drop out the old data partition in the LRU (least recently used) fashion. LRU is an algorithm which ensures the least frequently used data. It spills out that data from the cache. We can also remove the cache manually using <strong>RDD.unpersist()<\/strong> method.<\/span><\/p>\n<h2>7. Conclusion<\/h2>\n<p>Hence,<strong> Caching<\/strong> or <strong>persistence<\/strong> are the optimization techniques for interactive and iterative\u00a0Spark computations. It helps to save intermediate results so we can reuse them in subsequent stages. These intermediate results as RDDs are thus kept in memory (default) or more solid storages like disk and\/or replicated.<br \/>\n<strong>What Next &#8211;\u00a0<\/strong><\/p>\n<ul>\n<li><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-features\/\">Features of Apache Spark.<\/a><\/li>\n<li><a href=\"http:\/\/data-flair.training\/blogs\/limitations-of-apache-spark-overcome-spark-drawbacks\/\">Limitations of Apache Spark.<\/a><\/li>\n<li><a href=\"http:\/\/data-flair.training\/blogs\/how-apache-spark-works-run-time-spark-architecture\/\">How Apache Spark works.<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>1. Objective This blog covers the detailed view of\u00a0Apache Spark\u00a0RDD Persistence and Caching. This tutorial gives the answers for &#8211; What is RDD persistence, Why do we need to call cache or persist on&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":42329,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[1802,6339,11347,13105,13172,13862],"class_list":["post-2609","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-spark","tag-benifits-of-persistance-in-spark","tag-how-to-unpersist-rdd-in-spark","tag-rdd-persist-and-caching-mechanism","tag-spark-rdd-caching","tag-sparkrdd-persistcnce","tag-storage-levels-of-rdd-in-spark"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>RDD Persistence and Caching Mechanism in Apache Spark - DataFlair<\/title>\n<meta name=\"description\" content=\"Learn what is Rdd persistence and caching in spark,when to persist &amp; unpersist RDDs,why persistance,RDD caching &amp; persisting benefits,storage levels of RDD.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"RDD Persistence and Caching Mechanism in Apache Spark - DataFlair\" \/>\n<meta property=\"og:description\" content=\"Learn what is Rdd persistence and caching in spark,when to persist &amp; unpersist RDDs,why persistance,RDD caching &amp; persisting benefits,storage levels of RDD.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2017-05-19T12:39:52+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-11-16T08:28:11+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"RDD Persistence and Caching Mechanism in Apache Spark - DataFlair","description":"Learn what is Rdd persistence and caching in spark,when to persist & unpersist RDDs,why persistance,RDD caching & persisting benefits,storage levels of RDD.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/","og_locale":"en_US","og_type":"article","og_title":"RDD Persistence and Caching Mechanism in Apache Spark - DataFlair","og_description":"Learn what is Rdd persistence and caching in spark,when to persist & unpersist RDDs,why persistance,RDD caching & persisting benefits,storage levels of RDD.","og_url":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2017-05-19T12:39:52+00:00","article_modified_time":"2018-11-16T08:28:11+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"RDD Persistence and Caching Mechanism in Apache Spark","datePublished":"2017-05-19T12:39:52+00:00","dateModified":"2018-11-16T08:28:11+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/"},"wordCount":864,"commentCount":4,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2.jpg","keywords":["benifits of persistance in spark","how to unpersist rdd in spark","RDD Persist and caching mechanism","Spark RDD caching","SparkRdd persistcnce","storage levels of rdd in spark"],"articleSection":["Apache Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/","url":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/","name":"RDD Persistence and Caching Mechanism in Apache Spark - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2.jpg","datePublished":"2017-05-19T12:39:52+00:00","dateModified":"2018-11-16T08:28:11+00:00","description":"Learn what is Rdd persistence and caching in spark,when to persist & unpersist RDDs,why persistance,RDD caching & persisting benefits,storage levels of RDD.","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/RDD-Persistence-and-Caching-Mechanism-in-Apache-Spark-2.jpg","width":1200,"height":628,"caption":"RDD Persistence and Caching Mechanism in Apache Spark"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/spark\/"},{"@type":"ListItem","position":3,"name":"RDD Persistence and Caching Mechanism in Apache Spark"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2609","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=2609"}],"version-history":[{"count":6,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2609\/revisions"}],"predecessor-version":[{"id":42330,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2609\/revisions\/42330"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/42329"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=2609"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=2609"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=2609"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}