

{"id":18844,"date":"2018-06-23T04:20:16","date_gmt":"2018-06-23T04:20:16","guid":{"rendered":"https:\/\/data-flair.training\/blogs\/?p=18844"},"modified":"2021-05-12T11:09:11","modified_gmt":"2021-05-12T05:39:11","slug":"pyspark-broadcast-and-accumulator","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/","title":{"rendered":"PySpark Broadcast and Accumulator With Examples"},"content":{"rendered":"<p><span style=\"font-weight: 400\">As we know, <strong>Apache Spark<\/strong> uses shared variables, for parallel processing. Well, Shared Variables are of two types, Broadcast &amp; Accumulator. So, in this PySpark article, &#8220;PySpark Broadcast and Accumulator&#8221; we will learn the whole concept of Broadcast &amp; Accumulator using <strong>PySpark<\/strong>.<\/span><\/p>\n<p>So, let&#8217;s start the PySpark Broadcast and Accumulator.<\/p>\n<h2><span style=\"font-weight: 400\">PySpark Broadcast and Accumulator<\/span><\/h2>\n<p><span style=\"font-weight: 400\">On defining parallel processing, \u00a0when the driver sends a task to the executor on the cluster a copy of shared variable goes on each node of the cluster, so we can use it for performing tasks.<\/span><br \/>\n<span style=\"font-weight: 400\">Shared variables supported by Apache Spark in PySpark are two types of \u2212<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Broadcast<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Accumulator<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">Let\u2019s learn PySpark Broadcast and Accumulator in detail:<\/span><\/p>\n<h2>Broadcast Variables \u2013 PySpark<\/h2>\n<p><span style=\"font-weight: 400\">Basically, to save the copy of data across all nodes, Broadcast variables are used. However, on all the machines this variable is cached, not sent on machines. <\/span><\/p>\n<p><span style=\"font-weight: 400\">Also, we can use it to broadcast some information to all the executors. Although, it can be of any type, either preliminary type or a hash map. For Example,<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Single value <\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">Single value refers to the Common value for all the products.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Hashmap <\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">Whereas, Hashmap means, look up or map side join.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Moreover, \u00a0broadcasting dimension can have considerable performance improvement, when very large data set (fact) is tried to join with smaller data set (dimension). In addition, these variables are immutable.<\/span><\/p>\n<p><span style=\"font-weight: 400\">For PySpark, following code block has the details of a Broadcast class:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">class pyspark.Broadcast (\r\n  sc = None,\r\n  value = None,\r\n  pickle_registry = None,\r\n  path = None\r\n)<\/pre>\n<p><span style=\"font-weight: 400\">To use a Broadcast variable, here is an example, showing a Broadcast variable, it has an attribute called value, this attribute stores the data and then it is used to return a broadcasted value, such as:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">----------------------------------------broadcast.py--------------------------------------\r\nfrom pyspark import SparkContext\r\nsc = SparkContext(\"local\", \"Broadcast app\")\r\nwords_new = sc.broadcast([\"shreyash\", \"rishabh\", \"Shubham\", \"Sagar\", \"Mehul\"])\r\ndata = words_new.value\r\nprint \"Stored data -&gt; %s\" % (data)\r\nelem = words_new.value[2]\r\nprint \"Printing a particular element in RDD -&gt; %s\" % (elem)\r\n----------------------------------------broadcast.py--------------------------------------<\/pre>\n<p><strong>Command<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\">$SPARK_HOME\/bin\/spark-submit broadcast.py<\/pre>\n<p><strong style=\"font-family: Verdana, Geneva, sans-serif\">Output<\/strong><br \/>\n<b>Stored data -&gt; [<\/b><br \/>\n<b> \u00a0\u00a0&#8216;shreyash&#8217;, \u00a0<\/b><br \/>\n<b> \u00a0\u00a0&#8216;rishabh&#8217;, <\/b><br \/>\n<b> \u00a0\u00a0&#8216;shubham&#8217;, <\/b><br \/>\n<b> \u00a0\u00a0&#8216;Sagar&#8217;, <\/b><br \/>\n<b> \u00a0\u00a0&#8216;Mehul&#8217;<\/b><br \/>\n<b>]<\/b><br \/>\n<b>Printing a particular element in RDD -&gt; hadoop<\/b><\/p>\n<h2>Accumulators \u2013 Pyspark<\/h2>\n<p><span style=\"font-weight: 400\">For aggregating the information through associative and commutative operations, Accumulator variables are used. As an example, for a sum operation or counters (in MapReduce), we can use an accumulator. <\/span><span style=\"font-weight: 400\">In addition, we can use Accumulators in any Spark APIs.<\/span><\/p>\n<p><span style=\"font-weight: 400\">For PySpark, following code block has the details of an Accumulator class:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">class pyspark.Accumulator(aid, value, accum_param)<\/pre>\n<p><span style=\"font-weight: 400\">Here is an example, it also has an attribute called value as same as the broadcast variable, this attribute also stores the data and then it is used to return an accumulator value. However, \u00a0only in a driver program, it is usable.<\/span><\/p>\n<p><span style=\"font-weight: 400\">So, an accumulator variable is used by multiple workers and returns an accumulated value, in this example.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">----------------------------------------accumulator.py------------------------------------\r\nfrom pyspark import SparkContext\r\nsc = SparkContext(\"local\", \"Accumulator app\")\r\nnum = sc.accumulator(1)\r\ndef f(x):\r\n  global num\r\n  num+=x\r\nrdd = sc.parallelize([2,3,4,5])\r\nrdd.foreach(f)\r\nfinal = num.value\r\nprint \"Accumulated value is -&gt; %i\" % (final)\r\n----------------------------------------accumulator.py------------------------------------<\/pre>\n<p><strong>Command <\/strong><\/p>\n<pre class=\"EnlighterJSRAW\">$SPARK_HOME\/bin\/spark-submit accumulator.py<\/pre>\n<p><strong>Output<\/strong><br \/>\n<b>The accumulated value is 15.<\/b><br \/>\nSo, this was all about PySpark Broadcast and Accumulator. Hope you like our explanation.<\/p>\n<h2><span style=\"font-weight: 400\">Conclusion<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Hence, we have seen the\u00a0concept of PySpark Broadcast and Accumulator along with their examples. Hence, it will definitely help you to understand it well. Still, if any doubt,\u00a0ask in the comment tab.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As we know, Apache Spark uses shared variables, for parallel processing. Well, Shared Variables are of two types, Broadcast &amp; Accumulator. So, in this PySpark article, &#8220;PySpark Broadcast and Accumulator&#8221; we will learn the&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":19431,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[44],"tags":[10288,10289,10292,10293,10294,10295],"class_list":["post-18844","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-pyspark","tag-pyspark-accumualtor","tag-pyspark-accumulator-example","tag-pyspark-broadcast-dataframe","tag-pyspark-broadcast-example","tag-pyspark-broadcast-rdd","tag-pyspark-broadcast-variable"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>PySpark Broadcast and Accumulator With Examples - DataFlair<\/title>\n<meta name=\"description\" content=\"PySpark Broadcast and Accumulator, PySpark Broadcast variable, PySpark Accumulator variable, example of PySpark Broadcast, PySpark Broadcast RDD\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"PySpark Broadcast and Accumulator With Examples - DataFlair\" \/>\n<meta property=\"og:description\" content=\"PySpark Broadcast and Accumulator, PySpark Broadcast variable, PySpark Accumulator variable, example of PySpark Broadcast, PySpark Broadcast RDD\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-06-23T04:20:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-05-12T05:39:11+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Broadcast-Accumulator-01-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"PySpark Broadcast and Accumulator With Examples - DataFlair","description":"PySpark Broadcast and Accumulator, PySpark Broadcast variable, PySpark Accumulator variable, example of PySpark Broadcast, PySpark Broadcast RDD","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/","og_locale":"en_US","og_type":"article","og_title":"PySpark Broadcast and Accumulator With Examples - DataFlair","og_description":"PySpark Broadcast and Accumulator, PySpark Broadcast variable, PySpark Accumulator variable, example of PySpark Broadcast, PySpark Broadcast RDD","og_url":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2018-06-23T04:20:16+00:00","article_modified_time":"2021-05-12T05:39:11+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Broadcast-Accumulator-01-1.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"PySpark Broadcast and Accumulator With Examples","datePublished":"2018-06-23T04:20:16+00:00","dateModified":"2021-05-12T05:39:11+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/"},"wordCount":466,"commentCount":2,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Broadcast-Accumulator-01-1.jpg","keywords":["PySpark Accumualtor","PySpark Accumulator example","PySpark Broadcast dataframe","PySpark Broadcast example","PySpark Broadcast RDD","PySpark Broadcast variable"],"articleSection":["PySpark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/","url":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/","name":"PySpark Broadcast and Accumulator With Examples - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Broadcast-Accumulator-01-1.jpg","datePublished":"2018-06-23T04:20:16+00:00","dateModified":"2021-05-12T05:39:11+00:00","description":"PySpark Broadcast and Accumulator, PySpark Broadcast variable, PySpark Accumulator variable, example of PySpark Broadcast, PySpark Broadcast RDD","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Broadcast-Accumulator-01-1.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Broadcast-Accumulator-01-1.jpg","width":1200,"height":628,"caption":"PySpark Broadcast and Accumulator With Examples"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/pyspark-broadcast-and-accumulator\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"PySpark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/pyspark\/"},{"@type":"ListItem","position":3,"name":"PySpark Broadcast and Accumulator With Examples"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/18844","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=18844"}],"version-history":[{"count":6,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/18844\/revisions"}],"predecessor-version":[{"id":94290,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/18844\/revisions\/94290"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/19431"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=18844"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=18844"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=18844"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}