

{"id":18660,"date":"2018-06-22T04:12:07","date_gmt":"2018-06-22T04:12:07","guid":{"rendered":"https:\/\/data-flair.training\/blogs\/?p=18660"},"modified":"2021-05-12T11:09:12","modified_gmt":"2021-05-12T05:39:12","slug":"pyspark-sparkcontext","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/","title":{"rendered":"PySpark SparkContext With Examples and Parameters"},"content":{"rendered":"<p>\\In our last article, we see <strong>PySpark Pros and Cons<\/strong>. In this <strong>PySpark tutorial<\/strong>, we will learn the concept of PySpark <strong>SparkContext<\/strong>. Moreover, we will see SparkContext parameters. Apart from its Parameters, we will also see its PySpark SparkContext examples, to understand it in depth.<\/p>\n<p>So, let&#8217;s start PySpark SparkContext.<\/p>\n<h2>What is SparkContext in PySpark?<\/h2>\n<p><span style=\"font-weight: 400\">In simple words, an entry point to any <strong>Spark<\/strong> functionality is what we call SparkContext.\u00a0At the time we run any Spark application, a driver program starts, which has the main function and from this time your SparkContext gets initiated. <\/span><\/p>\n<p><span style=\"font-weight: 400\">Afterward, on worker nodes, driver program runs the operations inside the executors.<\/span><\/p>\n<p><span style=\"font-weight: 400\">In addition, to launch a <strong>JVM<\/strong>, SparkContext uses Py4J and then creates a JavaSparkContext. However, PySpark has SparkContext available as \u2018sc\u2019, by default,\u00a0thus the creation of a new SparkContext won&#8217;t work.<\/span><\/p>\n<div id=\"attachment_18662\" style=\"width: 869px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-18662\" class=\"wp-image-18662 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext.png\" alt=\"PySpark- SparkContext\" width=\"859\" height=\"1080\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext.png 859w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-119x150.png 119w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-239x300.png 239w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-768x966.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-814x1024.png 814w\" sizes=\"auto, (max-width: 859px) 100vw, 859px\" \/><\/a><p id=\"caption-attachment-18662\" class=\"wp-caption-text\">what is PySpark SparkContext<\/p><\/div>\n<p><span style=\"font-weight: 400\">Here is a code block which has the details of a PySpark class as well as the parameters,\u00a0those a SparkContext can take:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">class pyspark.SparkContext (\r\n  master = None,\r\n  appName = None,\r\n  sparkHome = None,\r\n  pyFiles = None,\r\n  environment = None,\r\n  batchSize = 0,\r\n  serializer = PickleSerializer(),\r\n  conf = None,\r\n  gateway = None,\r\n  jsc = None,\r\n  profiler_cls = &lt;class 'pyspark.profiler.BasicProfiler'&gt;\r\n)<\/pre>\n<h2><span style=\"font-weight: 400\">Parameters in PySpark SparkContext<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Further, we are listing all the parameters of a SparkContext in PySpark:<\/span><\/p>\n<h3>a. Master<\/h3>\n<p><span style=\"font-weight: 400\">This is the URL of the cluster it connects to.<\/span><\/p>\n<h3>b. appName<\/h3>\n<p><span style=\"font-weight: 400\">Basically, &#8220;appName&#8221; parameter refers to the name of your job.<\/span><\/p>\n<h3>c. SparkHome<\/h3>\n<p><span style=\"font-weight: 400\">Generally, sparkHome is a <strong>Spark installation<\/strong> directory.<\/span><\/p>\n<h3>d. pyFiles<\/h3>\n<p><span style=\"font-weight: 400\">Files like .zip or .py files are to send to the cluster and to add to the PYTHONPATH.<\/span><\/p>\n<h3>e. Environment<\/h3>\n<p><span style=\"font-weight: 400\"> Worker nodes environment variables.<\/span><\/p>\n<h3>f. BatchSize<\/h3>\n<p><span style=\"font-weight: 400\">Basically, as a single <strong>Java object<\/strong>, the number of <strong>Python objects<\/strong> represented. However, to disable batching, Set 1, and\u00a0to automatically choose the batch size based on object sizes set\u00a00,\u00a0Also,\u00a0to use an unlimited batch size, set -1.<\/span><\/p>\n<h3>g. Serializer<\/h3>\n<p><span style=\"font-weight: 400\">So, this parameter tell about Serializer, an RDD serializer.<\/span><\/p>\n<h3>h. Conf<\/h3>\n<p><span style=\"font-weight: 400\">Moreover, to set all the Spark properties, an object of L{SparkConf} is there.<\/span><\/p>\n<h3>i. Gateway<\/h3>\n<p><span style=\"font-weight: 400\">Basically, use an existing gateway as well as JVM,\u00a0else initialize a new JVM.<\/span><\/p>\n<h3>j. JSC<\/h3>\n<p><span style=\"font-weight: 400\">However, JSC is the JavaSparkContext instance.<\/span><\/p>\n<h3>k. profiler_cls<\/h3>\n<p><span style=\"font-weight: 400\">Basically, in order to do profiling, a class of custom Profiler is used. Although, make sure the pyspark.profiler.BasicProfiler is the default one.<\/span><\/p>\n<p><span style=\"font-weight: 400\">So, master and appname are mostly used, among the above parameters. However, any PySpark program&#8217;s\u00a0first two lines look as shown below \u2212<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">from pyspark import SparkContext\r\nsc = SparkContext(\"local\", \"First App1\")<\/pre>\n<h2><span style=\"font-weight: 400\">SparkContext Example \u2013 PySpark Shell<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Since we have learned much about PySpark SparkContext, now let\u2019s understand it with an example. Here we will count the number of the lines with character &#8216;x&#8217; or &#8216;y&#8217; in the README.md file. <\/span><\/p>\n<p><span style=\"font-weight: 400\">So, let\u2019s assume that there are 5 lines in a file. Hence, 3 lines have the character &#8216;x&#8217;, then the output will be \u2192 Line with x: 3. However, for character \u2018y\u2019, same will be done .<\/span><\/p>\n<p><span style=\"font-weight: 400\">However, make sure in the following PySpark SparkContext example we are not creating any SparkContext object it is because Spark automatically creates the SparkContext object named sc, by default, at the time PySpark shell starts.<br \/>\n<\/span><\/p>\n<p><span style=\"font-weight: 400\">So, If you try to create another SparkContext object, following error will occur \u2013 &#8220;ValueError: That says, it is not possible to run multiple SparkContexts at once&#8221;.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">&lt;&lt;&lt; logFile = \"file:\/\/\/home\/hadoop\/spark-2.1.0-bin-hadoop2.7\/README.md\"\r\n&lt;&lt;&lt; logData = sc.textFile(logFile).cache()\r\n&lt;&lt;&lt; numXs = logData.filter(lambda s: 'x' in s).count()\r\n&lt;&lt;&lt; numYs = logData.filter(lambda s: 'y' in s).count()\r\n&lt;&lt;&lt; print \"Lines with x: %i, lines with y: %i\" % (numXs, numYs)\r\nLines with x: 62, lines with y: 30<\/pre>\n<h2><span style=\"font-weight: 400\">SparkContext Example &#8211; Python Program<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Further, using a <strong>Python<\/strong> program, let\u2019s run the same example. So, create a <strong>Python file <\/strong>with name firstapp1.py and then enter the following code in that file.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">----------------------------------------firstapp1.py---------------------------------------\r\nfrom pyspark import SparkContext\r\nlogFile = \"file:\/\/\/home\/hadoop\/spark-2.1.0-bin-hadoop2.7\/README.md\"\r\nsc = SparkContext(\"local\", \"first app\")\r\nlogData = sc.textFile(logFile).cache()\r\nnumXs = logData.filter(lambda s: 'x' in s).count()\r\nnumYs = logData.filter(lambda s: 'y' in s).count()\r\nprint \"Lines with x: %i, lines with y: %i\" % (numXs, numYs)\r\n----------------------------------------firstapp1.py---------------------------------------<\/pre>\n<p>Then to run this Python file, we will execute the following command in the terminal.<br \/>\nHence, it will give the same <strong>output<\/strong> as above:<\/p>\n<pre class=\"EnlighterJSRAW\">$SPARK_HOME\/bin\/spark-submit firstapp1.py<\/pre>\n<p><b>Output: Lines with x: 62, lines with y: 30<\/b><\/p>\n<h2><span style=\"font-weight: 400\">Conclusion<\/span><\/h2>\n<p>Hence, we have seen the concept of PySpark SparkContext. Moreover, we have seen all the parameters for in-depth knowledge.<\/p>\n<p>Also, we have seen PySpark SparkContext examples to understand it well. However, if any doubt occurs, feel free to ask in the comment tab. Though we assure we will respond.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\\In our last article, we see PySpark Pros and Cons. In this PySpark tutorial, we will learn the concept of PySpark SparkContext. Moreover, we will see SparkContext parameters. Apart from its Parameters, we will&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":19398,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[44],"tags":[10301,10322,10324,10328,10330,13155,13161],"class_list":["post-18660","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-pyspark","tag-pyspark-example","tag-pyspark-spark-context","tag-pyspark-sparkcontext-example","tag-pyspark-tutorial","tag-pyspark-sparkcontext","tag-sparkcontext","tag-sparkcontext-in-pyspark"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>PySpark SparkContext With Examples and Parameters - DataFlair<\/title>\n<meta name=\"description\" content=\"PySpark Sparkcontext tutorial, What is SparkContext, Parameters, SparkContext Example,PySpark Example, PySpark Shell, Python Program\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"PySpark SparkContext With Examples and Parameters - DataFlair\" \/>\n<meta property=\"og:description\" content=\"PySpark Sparkcontext tutorial, What is SparkContext, Parameters, SparkContext Example,PySpark Example, PySpark Shell, Python Program\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-06-22T04:12:07+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-05-12T05:39:12+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"PySpark SparkContext With Examples and Parameters - DataFlair","description":"PySpark Sparkcontext tutorial, What is SparkContext, Parameters, SparkContext Example,PySpark Example, PySpark Shell, Python Program","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/","og_locale":"en_US","og_type":"article","og_title":"PySpark SparkContext With Examples and Parameters - DataFlair","og_description":"PySpark Sparkcontext tutorial, What is SparkContext, Parameters, SparkContext Example,PySpark Example, PySpark Shell, Python Program","og_url":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2018-06-22T04:12:07+00:00","article_modified_time":"2021-05-12T05:39:12+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-01.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"PySpark SparkContext With Examples and Parameters","datePublished":"2018-06-22T04:12:07+00:00","dateModified":"2021-05-12T05:39:12+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/"},"wordCount":646,"commentCount":0,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-01.jpg","keywords":["Pyspark example","PySpark Spark context","PySpark SparkContext example","PySpark tutorial","PySpark- SparkContext","SparkContext","SparkContext in PySpark"],"articleSection":["PySpark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/","url":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/","name":"PySpark SparkContext With Examples and Parameters - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-01.jpg","datePublished":"2018-06-22T04:12:07+00:00","dateModified":"2021-05-12T05:39:12+00:00","description":"PySpark Sparkcontext tutorial, What is SparkContext, Parameters, SparkContext Example,PySpark Example, PySpark Shell, Python Program","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-01.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-SparkContext-01.jpg","width":1200,"height":628,"caption":"PySpark SparkContext With Examples and Parameters"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/pyspark-sparkcontext\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"PySpark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/pyspark\/"},{"@type":"ListItem","position":3,"name":"PySpark SparkContext With Examples and Parameters"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/18660","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=18660"}],"version-history":[{"count":6,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/18660\/revisions"}],"predecessor-version":[{"id":94295,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/18660\/revisions\/94295"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/19398"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=18660"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=18660"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=18660"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}