Explain the term paired RDD in Apache Spark.

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 9:38 pm #6389
  
  DataFlair Team
  Spectator
  
  What are paired RDD?
  What do you understand by paired RDD in Spark?
- September 20, 2018 at 9:39 pm #6390
  
  DataFlair Team
  Spectator
  
  Introduction
  Paired RDD is a distributed collection of data with the key-value pair. It is a subset of Resilient Distributed Dataset. So it has all the feature of RDD and some new feature for the key-value pair. There are many transformation operations available for Paired RDD. These operations on Paired RDD are very useful to solve many use cases that require sorting, grouping, reducing some value/function.
  Commonly used operations on paired RDD are: groupByKey() reduceByKey() countByKey() join() etc
  Creation of Paired RDD:
  val pRDD:[(String),(Int)]=sc.textFile(“path_of_your_file”)
  .flatMap(line => line.split(” “))
  .map{word=>(word,word.length)}
  Also using subString method(if we have a file with id and some value, we can create paired rdd with id as key and value as other details)
  
  val pRDD2[(Int),(String)]=sc.textFile(“path_of_your_file”)
  .keyBy(line=>line.subString(1,5).trim().toInt)
  .mapValues(line=>line.subString(10,30).trim())
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.