How can we create RDD in Apache Spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 9:45 pm #6403
  
  DataFlair Team
  Spectator
  
  List the ways of creating RDDs in Spark.
  Describe how RDDs are created in Apache Spark.
- September 20, 2018 at 9:45 pm #6404
  
  DataFlair Team
  Spectator
  
  Resilient Distributed Datasets (RDD) is spark’s core abstraction which is a resilient distributed dataset.
  It is an immutable (read-only) distributed collection of objects.
  Each dataset in RDD is divided into logical partitions,
  which may be computed on different nodes of the cluster.
  Including user-defined classes, RDDs may contain any type of Python, Java, or Scala objects.
  
  In 3 ways we can create RDD in Apache Spark:
  1. Through distributing collection of objects
  2. By loading an external dataset
  3. From existing Apache Spark RDDs
  
  1. Using parallelized collection
  
  RDDs are generally created by parallelizing an existing collection
  i.e. by taking an existing collection in the program and passing
  it to SparkContext’s parallelize() method.
  
  scala > val data = Array(1,2,3,4,5)
  scala > val dataRDD = sc.parallelize (data)
  scala > dataRDD.count
  
  2. External Datasets
  
  In Spark, a distributed dataset can be formed from any data source supported by Hadoop.
  
  val dataRDD = spark.read.textFile(“F:/BigData/DataFlair/Spark/Posts.xml”).rdd
  
  3. Creating RDD from existing RDD
  
  Transformation is the way to create an RDD from already existing RDD.
  
  Transformation acts as a function that intakes an RDD and produces another resultant RDD.
  The input RDD does not get changed,
  Some of the operations applied on RDD are: filter, Map, FlatMap
  
  val dataRDD = spark.read.textFile(“F:/Mritunjay/BigData/DataFlair/Spark/Posts.xml”).rdd
  
  val resultRDD = data.filter{line => {line.trim().startsWith(“<row”)}
  }
  
  Go to the link, to study above mentioned methods in detail: How to Create RDDs in Apache Spark?
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

How can we create RDD in Apache Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses