{"id":2889,"date":"2017-06-17T06:15:19","date_gmt":"2017-06-17T06:15:19","guid":{"rendered":"http:\/\/data-flair.training\/blogs\/?p=2889"},"modified":"2018-11-16T14:48:16","modified_gmt":"2018-11-16T09:18:16","slug":"spark-streaming-checkpoint","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/","title":{"rendered":"Spark Streaming Checkpoint in Apache Spark"},"content":{"rendered":"<h2>1. Objective<\/h2>\n<p>In this blog of <strong>Apache Spark<\/strong> Streaming Checkpoint, you will read all about Spark Checkpoint. First of all, we will discuss What is Checkpointing in Spark, then, How Checkpointing helps to achieve Fault Tolerance in Apache Spark. Types of Apache Spark Checkpoint i.e Metadata Checkpointing, Data Checkpointing, the comparison between Checkpointing vs Persist() in Spark are also covered in this Spark tutorial.<\/p>\n<div id=\"attachment_42372\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-42372\" class=\"size-full wp-image-42372\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1.jpg\" alt=\"Spark Streaming Checkpoint in Apache Spark\" width=\"1200\" height=\"628\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1.jpg 1200w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1-768x402.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1-1024x536.jpg 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1-520x272.jpg 520w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><p id=\"caption-attachment-42372\" class=\"wp-caption-text\">Spark Streaming Checkpoint in Apache Spark<\/p><\/div>\n<p>If you want to learn Apache Spark from basics then our previous post on <a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-tutorial\/\"><strong>Apache Spark Introduction<\/strong><\/a> will help you. It has a Spark video which helps you to learn Apache Spark simply way.<\/p>\n<h2>2. Introduction to Spark Streaming Checkpoint<\/h2>\n<p>The need with<a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-streaming-comprehensive-guide\/\"><strong> Spark Streaming<\/strong><\/a> application is that it should be operational 24\/7. Thus, the system should also be fault tolerant. If any data is lost, the recovery should be speedy. Spark streaming accomplishes this using <strong>checkpointing.<\/strong><br \/>\nSo, <em>Checkpointing<\/em> is a process to truncate <a href=\"http:\/\/data-flair.training\/blogs\/directed-acyclic-graph-dag-in-apache-spark\/\">RDD lineage graph<\/a>. It saves the application state timely to reliable storage (<a href=\"http:\/\/data-flair.training\/blogs\/comprehensive-hdfs-guide-introduction-architecture-data-read-write-tutorial\/\"><strong>HDFS<\/strong><\/a>). As the driver restarts the recovery takes place.<br \/>\n<em>There are two types of data that we checkpoint in Spark:<\/em><\/p>\n<ol>\n<li><strong>Metadata Checkpointing &#8211;<\/strong>\u00a0<em>Metadata<\/em> means the data about data. It refers to saving the metadata to fault tolerant storage like HDFS. Metadata includes configurations, <strong>DStream<\/strong> operations, and incomplete batches. Configuration refers to the configuration used to create streaming <a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-streaming-transformation-operations\/\"><strong>DStream operations<\/strong><\/a> are operations which define the steaming application. Incomplete batches are batches which are in the queue but are not complete.<\/li>\n<li><strong>Data Checkpointing &#8211;<\/strong>: It refers to save the <strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-rdd-tutorial\/\">RDD<\/a> <\/strong>to reliable storage because its need arises in some of the <em>stateful transformations<\/em>. It is in the case when the upcoming RDD depends on the RDDs of previous batches. Because of this, the dependency keeps on increasing with time. Thus, to avoid such increase in recovery time the intermediate RDDs are periodically checkpointed to some reliable storage. As a result, it cuts down the dependency chain.<\/li>\n<\/ol>\n<p>To set the Spark checkpoint directory call: <em>SparkContext.setCheckpointDir(directory: String)<\/em><\/p>\n<h2>3. Types of Checkpointing in Apache Spark<\/h2>\n<p>There are two types of Apache Spark checkpointing:<\/p>\n<ol>\n<li><strong> Reliable Checkpointing &#8211;<\/strong> It refers to that checkpointing in which the actual RDD is saved in reliable distributed file system, e.g. HDFS. To set the checkpoint directory call:\u00a0<em>SparkContext.setCheckpointDir(directory: String)<\/em>. When running on the cluster the directory must be an HDFS path since the driver tries to recover the checkpointed RDD from a local file. While the checkpoint files are actually on the executor&#8217;s machines.<\/li>\n<li><strong>Local Checkpointing:<\/strong> In this checkpointing, in<em> Spark Streaming<\/em> or <em>GraphX<\/em> we truncate the RDD lineage graph in Spark. In this, the RDD is persisted to local storage in the executor.<\/li>\n<\/ol>\n<h2>4. Difference between Spark Checkpointing and Persist<\/h2>\n<p>There are various differences between Spark Checkpoint vs Persist. Let&#8217;s discuss it one by one-<br \/>\n<strong>Persist<\/strong><\/p>\n<ul>\n<li>When we <strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/\">persist RDD<\/a><\/strong> with <strong>DISK_ONLY<\/strong> storage level the RDD gets stored in a location where the subsequent use of that RDD will not reach that points in recomputing the lineage.<\/li>\n<li>After persist() is called, Spark remembers the lineage of the RDD even though it doesn&#8217;t call it.<\/li>\n<li>Secondly, after the job run is complete, the cache is cleared and the files are destroyed.<\/li>\n<\/ul>\n<p><strong>Checkpointing<\/strong><\/p>\n<ul>\n<li>Checkpointing stores the RDD in HDFS. It deletes the lineage which created it.<\/li>\n<li>On completing the job run unlike cache the checkpoint file is not deleted.<\/li>\n<li>When we checkpointing an RDD it results in double computation. The operation will first call a cache before accomplishing the actual job of computing. Secondly, it is written to checkpointing directory.<\/li>\n<\/ul>\n<h2>5. Conclusion<\/h2>\n<p>In the conclusion, using the method of checkpointing one can achieve <a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-streaming-fault-tolerance\/\"><strong>fault tolerance<\/strong><\/a>. It provides fault tolerance to the streaming data when it needs. Also when the read operation is complete the files are not removed like in persist method. Thus, the RDD needs to be checkpointed if the computation takes a long time or the computing chain is too long or if it depends on too many RDDs.<br \/>\n<em>If in case you have any query about Spark Streaming <\/em>Checkpoint<em>, do leave a comment in a section below. \u00a0<\/em><strong> \u00a0<\/strong><br \/>\n<strong>See Also-<\/strong><\/p>\n<ul>\n<li><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-performance-tuning\/\">Performance tuning in Apache Spark <\/a><\/li>\n<li><a href=\"http:\/\/data-flair.training\/blogs\/apache-storm-vs-apache-spark-streaming-comparison-guide\/\">Apache Storm vs Spark Streaming \u00a0 \u00a0<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>1. Objective In this blog of Apache Spark Streaming Checkpoint, you will read all about Spark Checkpoint. First of all, we will discuss What is Checkpointing in Spark, then, How Checkpointing helps to achieve&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":42372,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[896,903,13038,13055,13104,13130],"class_list":["post-2889","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-spark","tag-apache-spark","tag-apache-spark-checkpinting","tag-spark-checkpoint","tag-spark-dstream","tag-spark-rdd","tag-spark-streaming"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Spark Streaming Checkpoint in Apache Spark - DataFlair<\/title>\n<meta name=\"description\" content=\"Apache Spark Streaming Checkpoint introduction - what is Spark Checkpoint,spark Checkpointing types,Spark Checkpointing vs Persist,Caching and checkpointing\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Spark Streaming Checkpoint in Apache Spark - DataFlair\" \/>\n<meta property=\"og:description\" content=\"Apache Spark Streaming Checkpoint introduction - what is Spark Checkpoint,spark Checkpointing types,Spark Checkpointing vs Persist,Caching and checkpointing\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2017-06-17T06:15:19+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-11-16T09:18:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Spark Streaming Checkpoint in Apache Spark - DataFlair","description":"Apache Spark Streaming Checkpoint introduction - what is Spark Checkpoint,spark Checkpointing types,Spark Checkpointing vs Persist,Caching and checkpointing","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/","og_locale":"en_US","og_type":"article","og_title":"Spark Streaming Checkpoint in Apache Spark - DataFlair","og_description":"Apache Spark Streaming Checkpoint introduction - what is Spark Checkpoint,spark Checkpointing types,Spark Checkpointing vs Persist,Caching and checkpointing","og_url":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2017-06-17T06:15:19+00:00","article_modified_time":"2018-11-16T09:18:16+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"Spark Streaming Checkpoint in Apache Spark","datePublished":"2017-06-17T06:15:19+00:00","dateModified":"2018-11-16T09:18:16+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/"},"wordCount":704,"commentCount":2,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1.jpg","keywords":["apache spark","Apache Spark Checkpinting","Spark Checkpoint","Spark DStream","spark rdd","spark streaming"],"articleSection":["Apache Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/","url":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/","name":"Spark Streaming Checkpoint in Apache Spark - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1.jpg","datePublished":"2017-06-17T06:15:19+00:00","dateModified":"2018-11-16T09:18:16+00:00","description":"Apache Spark Streaming Checkpoint introduction - what is Spark Checkpoint,spark Checkpointing types,Spark Checkpointing vs Persist,Caching and checkpointing","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/spark-streaming-checkpoint-in-apache-spark-1.jpg","width":1200,"height":628,"caption":"Spark Streaming Checkpoint in Apache Spark"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/spark-streaming-checkpoint\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/spark\/"},{"@type":"ListItem","position":3,"name":"Spark Streaming Checkpoint in Apache Spark"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2889","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=2889"}],"version-history":[{"count":5,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2889\/revisions"}],"predecessor-version":[{"id":42373,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2889\/revisions\/42373"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/42372"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=2889"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=2889"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=2889"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}