

{"id":18577,"date":"2018-06-21T04:00:45","date_gmt":"2018-06-20T22:30:45","guid":{"rendered":"https:\/\/data-flair.training\/blogs\/?p=18577"},"modified":"2025-04-14T17:48:53","modified_gmt":"2025-04-14T12:18:53","slug":"pyspark-tutorial","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/","title":{"rendered":"PySpark Tutorial &#8211; Why PySpark is Gaining Hype among Data Scientists?"},"content":{"rendered":"<p>With this PySpark tutorial, we will take you to a beautiful journey which will involve various aspects of PySpark framework. And, we assure you that by the end of this journey, you will gain expertise in PySpark.<\/p>\n<p>The PySpark framework is gaining high popularity in the data science field. Spark is a very useful tool for data scientists to translate the research code into production code, and PySpark makes this process easily accessible.<\/p>\n<p>Without wasting any time, let&#8217;s start with our PySpark tutorial.<\/p>\n<h3><span style=\"font-weight: 400;\">What is PySpark?<\/span><\/h3>\n<p>PySpark, released by Apache Spark community, is basically a Python API for supporting Python with Spark. By utilizing PySpark, you can work and integrate with RDD easily in Python. The library Py4j helps to achieve this feature.<\/p>\n<p>There are several features of PySpark framework:<\/p>\n<p>1. Faster processing than other frameworks.<\/p>\n<p>2. Real-time computations and low latency due to in-memory processing.<\/p>\n<p>3. Polyglot, which means compatible with several languages like Java, Python, Scala and R.<\/p>\n<p>4. Powerful caching and efficient disk persistence.<\/p>\n<p>5. Deployment can be performed by Hadoop through Yarn.<\/p>\n<h3><span style=\"font-weight: 400;\">Audience for PySpark Tutorial<\/span><\/h3>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">The professionals who are aspiring to make a career in programming language and also those who want to perform real-time processing through framework can go for this PySpark tutorial. <\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Also,\u00a0those who want to\u00a0learn PySpark along with its\u00a0several modules, as well as submodules, must go for this PySpark tutorial.<\/span><\/li>\n<\/ul>\n<h3><span style=\"font-weight: 400;\">Prerequisites to PySpark<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">We assume that before learning PySpark, the readers already have basic knowledge about the programming language as well as frameworks. Also,\u00a0it is recommended to have a sound knowledge of Spark,\u00a0Hadoop, Scala Programming Language,\u00a0HDFS as well as Python.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Factors about PySpark API<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">There are a few key differences between the Python and Scala APIs which we will discuss in this PySpark Tutorial:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Since Python is dynamically typed, therefore PySpark RDDs can easily hold objects of multiple types.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">PySpark doesn&#8217;t support some API calls, like lookup and non-text input files. However, this feature will be added in future releases.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Although RDDs support the same methods as their Scala counterparts in PySpark but takes <strong>Python functions<\/strong> and returns Python collection types as a result. Using Python\u2019s lambda syntax, short functions can be passed to RDD methods.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">logData = sc.textFile(logFile).cache()\r\nerrors = logData.filter(lambda line: \"ERROR\" in line)<\/pre>\n<p><span style=\"font-weight: 400;\">Basically, the<\/span><span style=\"font-weight: 400;\">\u00a0functions in PySpark which are defined with the def keyword;\u00a0 can\u00a0be passed easily. And, this is very beneficial for longer functions that cannot be shown using the lambda:<\/span><span style=\"font-family: Verdana, Geneva, sans-serif; font-weight: inherit;\">\u00a0<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">def is_error(line):\r\n   return \"ERROR\" in line\r\nerrors = logData.filter(is_error)<\/pre>\n<p>Moreover, in enclosing scopes functions can access objects, however, to those objects modifications within RDD methods will not be propagated back:<\/p>\n<pre class=\"EnlighterJSRAW\">error_keywords = [\"Exception\", \"Error\"]\r\ndef is_error(line):\r\n   return any(keyword in line for keyword in error_keywords)\r\nerrors = logData.filter(is_error)<\/pre>\n<p>Also, to launch an interactive shell, PySpark fully supports interactive use, so simply run .\/bin\/pyspark.<\/p>\n<h3><span style=\"font-weight: 400;\">Installing and Configuring PySpark<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Basically, we require Python 2.6 or higher for PySpark. Moreover, by using a standard CPython interpreter in order to support <strong>Python modules<\/strong> that use C extensions, we can execute PySpark applications. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">In addition, PySpark requires python to be available on the system PATH and use it to run programs by default. However, by setting the PYSPARK_PYTHON environment variable in conf\/spark-env.sh (or .cmd on Windows), an alternate Python executable may be specified.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Moreover, including Py4J, all of PySpark\u2019s library dependencies, are in a bundle with PySpark.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Further, using the bin\/pyspark script, Standalone PySpark applications\u00a0must run. Also,\u00a0using the settings in conf\/spark-env.sh or .cmd, it automatically configures the Java as well as Python environment. And, while it comes to the bin\/pyspark package, the script automatically adds to the PYTHONPATH.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Interactive Use of PySpark<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">To run PySpark applications, the bin\/pyspark script launches a Python interpreter. At first build Spark, then launch it directly from the command line without any options, to use PySpark interactively:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">$ sbt\/sbt assembly\r\n$ .\/bin\/pyspark<\/pre>\n<p><span style=\"font-weight: 400;\">Also, to explore data interactively we can use the Python shell and moreover it is a simple way to learn the API:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">words = sc.textFile(\"\/usr\/share\/dict\/words\")\r\nwords.filter(lambda w: w.startswith(\"spar\")).take(5)\r\n[u'spar', u'sparable', u'sparada', u'sparadrap', u'sparagrass']\r\nhelp(pyspark) # Show all pyspark functions<\/pre>\n<p><span style=\"font-weight: 400;\">However, the bin\/pyspark shell creates SparkContext that runs applications locally on a single core, by default. Further, <\/span><span style=\"font-weight: 400;\">set the MASTER environment variable, in order\u00a0<\/span><span style=\"font-weight: 400;\">to connect to a non-local cluster, or also to use multiple cores. <\/span><\/p>\n<p><strong>For example:<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">If we want to use the bin\/pyspark shell along with the standalone <strong>Spark cluster<\/strong>:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">$ MASTER=spark:\/\/IP:PORT .\/bin\/pyspark<\/pre>\n<p><span style=\"font-weight: 400;\">Or, to use four cores on the local machine:<\/span><br \/>\n<b><\/b><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">$ MASTER=local[4] .\/bin\/pyspark<\/pre>\n<h3><span style=\"font-weight: 400;\">IPython<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Moreover, we can easily launch PySpark in IPython by following this PySpark tutorial. Here, IPython refers to an enhanced Python interpreter. Basically, with IPython 1.0.0, PySpark can easily work. Hence, set the IPYTHON variable to 1\u00a0at the time of running bin\/pyspark, to use IPython:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">$ IPYTHON=1 .\/bin\/pyspark<\/pre>\n<p><span style=\"font-weight: 400;\">In addition, by setting IPYTHON_OPTS, we can customize the of its command. <\/span><\/p>\n<p><strong>For example:<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">In order to launch the IPython Notebook by using PyLab graphing support:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">$ IPYTHON_OPTS=\"notebook --pylab inline\" .\/bin\/pyspark<\/pre>\n<p><span style=\"font-weight: 400;\">Moreover, if we set the MASTER environment variable, IPython also works on a cluster or on multiple cores.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Standalone Programs<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">By creating a SparkContext in our script and by running the script using bin\/pyspark, we can use PySpark from standalone <strong>Python scripts<\/strong>. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">So, using the Python API (PySpark), we will see how to write a standalone application.<\/span><\/p>\n<p><strong>For example<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">Here we are creating a simple Spark application, SimpleApp1.py:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">\"\"\"SimpleApp1.py\"\"\"\r\nfrom pyspark import SparkContext\r\nlogFile = \"$YOUR_SPARK_HOME\/README.md\"  # Should be some file on your system\r\nsc = SparkContext(\"local\", \"Simple App1\")\r\nlogData = sc.textFile(logFile).cache()\r\nnumAs = logData.filter(lambda s: 'a' in s).count()\r\nnumBs = logData.filter(lambda s: 'b' in s).count()\r\nprint \"Lines with a: %i, lines with b: %i\" % (numAs, numBs)<\/pre>\n<p><span style=\"font-weight: 400;\">Basically, this program just counts the number of lines in \u2018a\u2019 and \u2018b\u2019 in a text file. However, we need to replace $YOUR_SPARK_HOME with the <strong>Spark&#8217;s installation<\/strong> location. Moreover, we use a SparkContext to create RDDs, with the Scala and Java examples. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">Now, using the bin\/pyspark script, we can run this application:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">$ cd $SPARK_HOME\r\n$ .\/bin\/pyspark SimpleApp1.py\r\n...\r\nLines with a: 46, Lines with b: 23<\/pre>\n<p><span style=\"font-weight: 400;\">In the SparkContext constructor, we code deploy dependencies by listing them in the pyFiles option:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">from pyspark import SparkContext\r\nsc = SparkContext(\"local\", \"App Name\", pyFiles=['MyFile.py', 'lib.zip', 'app.egg'])<\/pre>\n<p><span style=\"font-weight: 400;\">Moreover, all the files which are enlisted here will be further added to the PYTHONPATH and after that, it will ship to remote worker machines. Further, we can add Code dependencies to an existing SparkContext with the help of its addPyFile() method.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Also, by passing a SparkConf object to SparkContext, we can set configuration properties:<\/span><\/p>\n<pre class=\"EnlighterJSRAW\">from pyspark import SparkConf, SparkContext\r\nconf = (SparkConf()\r\n        .setMaster(\"local\")\r\n        .setAppName(\"My app\")\r\n        .set(\"spark.executor.memory\", \"1g\"))\r\nsc = SparkContext(conf = conf)<\/pre>\n<h3>Comparison: Python vs Scala<\/h3>\n<p><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/Python-vs-Scala-01.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-19366 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/Python-vs-Scala-01.jpg\" alt=\"PySpark Tutorial\" width=\"1200\" height=\"628\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/Python-vs-Scala-01.jpg 1200w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/Python-vs-Scala-01-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/Python-vs-Scala-01-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/Python-vs-Scala-01-768x402.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/Python-vs-Scala-01-1024x536.jpg 1024w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><\/p>\n<ul>\n<li><strong>Performance<\/strong><\/li>\n<\/ul>\n<p><strong>Python-<\/strong> In terms of performance, it is slower than Scala.<\/p>\n<p><strong>Scala-<\/strong> Scala is 10 times faster than Python.<\/p>\n<ul>\n<li><strong>Type Safety<\/strong><\/li>\n<\/ul>\n<p><strong>Python-<\/strong> It is a dynamically typed language.<\/p>\n<p><strong>Scala-<\/strong> It is Statically, typed language.<\/p>\n<ul>\n<li><strong>Ease of Use<\/strong><\/li>\n<\/ul>\n<p><strong>Python-<\/strong> Comparatively, it is less verbose and also easy in use.<\/p>\n<p><strong>Scala-<\/strong> It is highly verbose language.<\/p>\n<ul>\n<li><strong>Advanced Features<\/strong><\/li>\n<\/ul>\n<p><strong>Python-<\/strong> For <strong>machine learning<\/strong> and natural language processing,\u00a0Scala does not have sufficient data science tools and libraries like Python.<\/p>\n<p><strong>Scala-<\/strong> However, it has several existential types, macros, and implicit but still it lacks in visualizations and local data transformations.<\/p>\n<p>PySpark finds many uses like for interactive data analysis, data manipulation, structured streaming, ETL processing, etc.<\/p>\n<h3>Summary<\/h3>\n<p><span style=\"font-weight: 400;\">Python has a rich library set that why the majority of data scientists and analytics experts use Python nowadays. Therefore, Python Spark integrating is a boon to them.<\/span><\/p>\n<p>We saw the concept of PySpark framework,\u00a0which helps to support Python with Spark. We also discussed PySpark meaning, use of PySpark, installation, and configurations in PySpark.<\/p>\n<p>For more articles on PySpark, keep visiting DataFlair. Still, if any doubt in this PySpark tutorial, ask in the comment section.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>With this PySpark tutorial, we will take you to a beautiful journey which will involve various aspects of PySpark framework. And, we assure you that by the end of this journey, you will gain&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":19376,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[44],"tags":[19711,19710,10328,19709,10925],"class_list":["post-18577","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-pyspark","tag-pyspark-installation","tag-pyspark-prerequisites","tag-pyspark-tutorial","tag-pyspark-uses","tag-python-vs-scala"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>PySpark Tutorial - Why PySpark is Gaining Hype among Data Scientists? - DataFlair<\/title>\n<meta name=\"description\" content=\"PySpark tutorial - Learn PySpark API factors, PySpark uses, installation, IPython, Standalone programs, Python vs Scala. Learn for free!\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"PySpark Tutorial - Why PySpark is Gaining Hype among Data Scientists? - DataFlair\" \/>\n<meta property=\"og:description\" content=\"PySpark tutorial - Learn PySpark API factors, PySpark uses, installation, IPython, Standalone programs, Python vs Scala. Learn for free!\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-06-20T22:30:45+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-14T12:18:53+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Tutorial-For-Beginners-01-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"PySpark Tutorial - Why PySpark is Gaining Hype among Data Scientists? - DataFlair","description":"PySpark tutorial - Learn PySpark API factors, PySpark uses, installation, IPython, Standalone programs, Python vs Scala. Learn for free!","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/","og_locale":"en_US","og_type":"article","og_title":"PySpark Tutorial - Why PySpark is Gaining Hype among Data Scientists? - DataFlair","og_description":"PySpark tutorial - Learn PySpark API factors, PySpark uses, installation, IPython, Standalone programs, Python vs Scala. Learn for free!","og_url":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2018-06-20T22:30:45+00:00","article_modified_time":"2025-04-14T12:18:53+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Tutorial-For-Beginners-01-1.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"PySpark Tutorial &#8211; Why PySpark is Gaining Hype among Data Scientists?","datePublished":"2018-06-20T22:30:45+00:00","dateModified":"2025-04-14T12:18:53+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/"},"wordCount":1168,"commentCount":7,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Tutorial-For-Beginners-01-1.jpg","keywords":["PySpark Installation","PySpark Prerequisites","PySpark tutorial","PySpark Uses","python vs scala"],"articleSection":["PySpark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/","url":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/","name":"PySpark Tutorial - Why PySpark is Gaining Hype among Data Scientists? - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Tutorial-For-Beginners-01-1.jpg","datePublished":"2018-06-20T22:30:45+00:00","dateModified":"2025-04-14T12:18:53+00:00","description":"PySpark tutorial - Learn PySpark API factors, PySpark uses, installation, IPython, Standalone programs, Python vs Scala. Learn for free!","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Tutorial-For-Beginners-01-1.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/06\/PySpark-Tutorial-For-Beginners-01-1.jpg","width":1200,"height":628,"caption":"PySpark Tutorial For Beginners - Learn PySpark"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/pyspark-tutorial\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"PySpark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/pyspark\/"},{"@type":"ListItem","position":3,"name":"PySpark Tutorial &#8211; Why PySpark is Gaining Hype among Data Scientists?"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/18577","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=18577"}],"version-history":[{"count":12,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/18577\/revisions"}],"predecessor-version":[{"id":144848,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/18577\/revisions\/144848"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/19376"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=18577"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=18577"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=18577"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}