PySpark Tutorial For Beginners – Learn PySpark in 7 Min.

1. Objective – PySpark Tutorial

Today, we will start our new journey with PySpark Tutorial. In this PySpark tutorial, we will see what is PySpark, how it works. Moreover, we will learn differences in Python API, how to install PySpark and PySpark configurations. Also, we will discuss PySpark uses. Along with this, we will see the programs in PySpark and also discuss the comparison of Python vs Scala.

PySpark is a Spark Python API which exposes the Spark programming model to Python. Also, it is a tool which supports Python with Spark. However, there is much more to learn about it, So, in this article, we will discuss all the aspects to get in-depth knowledge of it.
So, let’s start PySpark Tutorial.

PySpark Tutorial

PySpark Tutorial For Beginners – Learn PySpark

2. What is PySpark?

Learning PySpark we know that in Scala programming language, Spark is written. But Apache Spark Community released a tool, PySpark to support Python with Spark. Also, we can work with RDDs in Python programming language, by using PySpark. It is only possible with Py4j, with the help of this they are able to achieve this.

In addition, it provides a PySpark Shell to us. Its main goal is to link the Python API to the spark core and also it initializes the Spark context. As Python has a rich library set that why the majority of data scientists and analytics experts use Python nowadays. Hence, Python Spark integrating is a boon to them.

3. PySpark Tutorial – Audience

  • Those professionals who are aspiring to make a career in programming language and also those who want to do real-time processing framework can go for this PySpark tutorial.
  • Also, those who want to learn PySpark along with its several modules, as well as submodules, must go for this PySpark tutorial.

4. PySpark Tutorial – Prerequisites

We assume that before learning PySpark tutorial, the readers already have basic knowledge about the programming language as well as framework. Also, it is recommended to have a sound knowledge of Spark, Hadoop, Scala Programming Language, HDFS as well as Python.

5. Key Differences in the Python API

Although, between the Python and Scala APIs, there are a few key differences in this PySpark Tutorial:

  • Since Python is dynamically typed hence RDDs can easily hold objects of multiple types.
  • Till, it does not support a few API calls, like lookup and non-text input files. However, we can say these may add in future releases.

Although, RDDs support the same methods as their Scala counterparts, in PySpark also still it takes Python functions and returns Python collection types as result. However, using Python’s lambda syntax, short functions can be passed to RDD methods.

logData = sc.textFile(logFile).cache()
errors = logData.filter(lambda line: "ERROR" in line)

Basically, the functions in PySpark which are defined with the def keyword;  can be passed easily. And, longer functions that can’t be expressed using the lambda this is very useful: 

def is_error(line):
   return "ERROR" in line
errors = logData.filter(is_error)

Moreover, in enclosing scopes functions can access objects, however, to those objects modifications within RDD methods will not be propagated back:

Do you know about PySpark RDD Operations

error_keywords = ["Exception", "Error"]
def is_error(line):
   return any(keyword in line for keyword in error_keywords)
errors = logData.filter(is_error)

Also, to launch an interactive shell, PySpark fully supports interactive use, so simply run ./bin/pyspark.

6. Installing and Configuring PySpark

Basically, we require Python 2.6 or higher for PySpark. Moreover, by using a standard CPython interpreter in order to support Python modules that use C extensions, we can execute PySpark applications.

In addition, PySpark requires python to be available on the system PATH and use it to run programs by default. However, by setting the PYSPARK_PYTHON environment variable in conf/spark-env.sh (or .cmd on Windows), an alternate Python executable may be specified.
Moreover, including Py4J, all of PySpark’s library dependencies, are in a bundle with PySpark.

Further, using the bin/pyspark script, Standalone PySpark applications must run. Also, using the settings in conf/spark-env.sh or .cmd, it automatically configures the Java as well as Python environment. And, while it comes to the bin/pyspark package, the script automatically adds to the PYTHONPATH.

Let’s revise PySpark Broadcast and Accumulator

7. Interactive Use of PySpark

To run PySpark applications, the bin/pyspark script launches a Python interpreter. At first build Spark, then launch it directly from the command line without any options, to use PySpark interactively:

$ sbt/sbt assembly
$ ./bin/pyspark

Also, to explore data interactively we can use the Python shell and moreover it is a simple way to learn the API:

words = sc.textFile("/usr/share/dict/words")
words.filter(lambda w: w.startswith("spar")).take(5)
[u'spar', u'sparable', u'sparada', u'sparadrap', u'sparagrass']
help(pyspark) # Show all pyspark functions

However, the bin/pyspark shell creates SparkContext that runs applications locally on a single core, by default. Further, set the MASTER environment variable, in order to connect to a non-local cluster, or also to use multiple cores.

For example:

If we want to use the bin/pyspark shell along with the standalone Spark cluster:

$ MASTER=spark://IP:PORT ./bin/pyspark

Or, to use four cores on the local machine:

$ MASTER=local[4] ./bin/pyspark

8. PySpark Tutorial – IPython

Moreover, we can easily launch PySpark in IPython by following this PySpark tutorial. Here, IPython refers to an enhanced Python interpreter. Basically, with IPython 1.0.0, PySpark can easily work. Hence, set the IPYTHON variable to 1 at the time of running bin/pyspark, to use IPython:

Let’s learn about PySpark Career Scope

$ IPYTHON=1 ./bin/pyspark

In addition, by setting IPYTHON_OPTS, we can customize the of its command.

For example:

In order to launch the IPython Notebook by using PyLab graphing support:

$ IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark

Moreover, if we set the MASTER environment variable, IPython also works on a cluster or on multiple cores.

9. PySpark Tutorial – Standalone Programs

By creating a SparkContext in our script and by running the script using bin/pyspark, we can use PySpark from standalone Python scripts.
So, using the Python API (PySpark), we will see how to write a standalone application.

Do you know about PySpark Serializers

For example

Here we are creating a simple Spark application, SimpleApp1.py:

"""SimpleApp1.py"""
from pyspark import SparkContext
logFile = "$YOUR_SPARK_HOME/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App1")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

Basically, this program just counts the number of lines in ‘a’ and ‘b’ in a text file. However, we need to replace $YOUR_SPARK_HOME with the Spark’s installation location. Moreover, we use a SparkContext to create RDDs, with the Scala and Java examples.

Now, using the bin/pyspark script, we can run this application:

$ cd $SPARK_HOME
$ ./bin/pyspark SimpleApp1.py
...
Lines with a: 46, Lines with b: 23

In the SparkContext constructor, we code deploy dependencies by listing them in the pyFiles option:

from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'lib.zip', 'app.egg'])

Moreover, all the files which are enlisted here will be further added to the PYTHONPATH and after that, it will ship to remote worker machines. Further, we can add Code dependencies to an existing SparkContext with the help of its addPyFile() method.

Also, by passing a SparkConf object to SparkContext, we can set configuration properties:

Let’s take a look at PySpark SparkConf

from pyspark import SparkConf, SparkContext
conf = (SparkConf()
        .setMaster("local")
        .setAppName("My app")
        .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)

10. Comparison: Python vs Scala

PySpark Tutorial

difference between Python and Scala

  • Performance

Python- In terms of performance, it is slower than Scala.

Scala- Scala is 10 times faster than Python.

  • Type Safety

Python- It is dynamically typed language.

Scala- It is Statically, typed language.

  • Ease of Use

Python- Comparatively, it is less verbose and also easy in use.

Scala- It is highly verbose language.

  • Advanced Features

Python- For machine learning and natural language processing, Scala does not have sufficient data science tools and libraries like Python.

Scala- However, it has several existential types, macros, and implicit but still it lacks in visualizations and local data transformations.

So, this was all in PySpark tutorial. Hope you like our explanation.

11. Conclusion: PySpark Tutorial

Hence, in this PySpark Tutorial, we saw the concept of a tool PySpark, which helps to support Python with Spark. Moreover, we discussed PySpark meaning, use of PySpark, installation, and configurations in PySpark. For more articles on PySpark, keep visiting DataFlair. Still, if any doubt in this PySpark tutorial, ask in the comment tab.

See also – 

PySpark Storage level
For reference

5 Responses

  1. Kuldeep Chouhan says:

    Hi,
    do you guys offer Pyspark course?

    • Data Flair says:

      Hello Kuldeep,
      Thank you for referring PySpark Tutorial. Hope this article cleared all your PySpark Concepts.
      You have asked for PySpark Course. Currently, we don’t have such a course, but you can take help of our published blogs on PySpark tutorial. We already published a complete course tutorial for PySpark which contains all the topics. After completing this course, you can also prepare yourself with our blog on PySpark Interview Questions.
      Try this program once, then share your feedback with us.

  2. Shwata says:

    Very well written and esy to understand ..even it gives compete detailed information .. learnt and enjoyed alot

    • Data Flair says:

      Thanks, Shwata for nicely reviewing us. Your words on PySpark Tutorial means a lot to us. Glad to see that you enjoyed this PySpark Tutorial. You can follow our more Python articles surely you will enjoy them.
      If you want to learn any more topic related to PySpark, freely share with us. We will happy to help you.

  3. Dhanu says:

    What is the difference between Pyspark standalone API and pyspark with HDFS and YARN..?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.