How to create RDD in Apache Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #5294
      DataFlair TeamDataFlair Team
      Spectator

      How to create RDD in Apache Spark?

    • #5296
      DataFlair TeamDataFlair Team
      Spectator

      There are three ways to create RDD
      (1) By Parallelizing collections in driver program
      (2) By loading an external dataset
      (3) Creating RDD from already existing RDDs.


      Create RDD By Parallelizing collections :

      Parallelized collections are created by calling parallelize() method on an existing collection in driver program.

      val rdd1 = Array(1,2,3,4,5)
      val rdd2 = sc.parallelize(rdd1)

      OR 

      val myList = sc.parallelize(List(1 to 1000), 5) where 5 is the number of partitions
       [If we do not specify then default partition is 1

      Create by loading an external Dataset

      In Spark, distributed dataset can be formed from any data source supported by Hadoop, including the local file system, HDFS, Cassandra, HBase etc. In this, the data is loaded from the external dataset. To create text file RDD, we can use SparkContext’s textFile method. It takes URL of the file and read it as a collection of line. URL can be a local path on the machine or a hdfs://, s3n://, etc. Use SparkSession.read to access an instance of DataFrameReader. DataFrameReader supports many file formats-

      i) csv (String path)

      import org.apache.spark.sql.SparkSession
      def main(args: Array[String]):Unit = {
      object DataFormat {
      val spark =  SparkSession.builder.appName("AvgAnsTime").master("local").getOrCreate()
      val dataRDD = spark.read.csv("path/of/csv/file").rdd

      ii) json (String path)

      val dataRDD = spark.read.json("path/of/json/file").rdd

      iii) textFile (String path)

      val dataRDD = spark.read.textFile("path/of/text/file").rdd


      Creating RDD from existing RDD:

      Transformation mutates one RDD into another RDD, thus transformation is the way to create an RDD from already existing RDD.

      val words=spark.sparkContext.parallelize(Seq("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
       "dog"))
      val wordPair = words.map(w => (w.charAt(0), w))
      wordPair.foreach(println)

      For detailed description on creating RDD read How to create RDD in Apache Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.