How can we create RDD in Apache Spark?

Viewing 1 reply thread
  • Author
    • #6403
      DataFlair Team

      List the ways of creating RDDs in Spark.
      Describe how RDDs are created in Apache Spark.

    • #6404
      DataFlair Team

      Resilient Distributed Datasets (RDD) is spark’s core abstraction which is a resilient distributed dataset.
      It is an immutable (read-only) distributed collection of objects.
      Each dataset in RDD is divided into logical partitions,
      which may be computed on different nodes of the cluster.
      Including user-defined classes, RDDs may contain any type of Python, Java, or Scala objects.

      In 3 ways we can create RDD in Apache Spark:
      1. Through distributing collection of objects
      2. By loading an external dataset
      3. From existing Apache Spark RDDs

      1. Using parallelized collection

      RDDs are generally created by parallelizing an existing collection
      i.e. by taking an existing collection in the program and passing
      it to SparkContext’s parallelize() method.

      scala > val data = Array(1,2,3,4,5)
      scala > val dataRDD = sc.parallelize (data)
      scala > dataRDD.count

      2. External Datasets

      In Spark, a distributed dataset can be formed from any data source supported by Hadoop.

      val dataRDD =“F:/BigData/DataFlair/Spark/Posts.xml”).rdd

      3. Creating RDD from existing RDD

      Transformation is the way to create an RDD from already existing RDD.

      Transformation acts as a function that intakes an RDD and produces another resultant RDD.
      The input RDD does not get changed,
      Some of the operations applied on RDD are: filter, Map, FlatMap

      val dataRDD =“F:/Mritunjay/BigData/DataFlair/Spark/Posts.xml”).rdd

      val resultRDD = data.filter{line => {line.trim().startsWith(“<row”)}

      Go to the link, to study above mentioned methods in detail: How to Create RDDs in Apache Spark?

Viewing 1 reply thread
  • You must be logged in to reply to this topic.