> join() is transformation.
> It’s in package org.apache.spark.rdd.pairRDDFunction
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]Permalink
Return an RDD containing all pairs of elements with matching keys in this and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Performs a hash join across the cluster.
From :
http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#213_Join
It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
Example1:
val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))
val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64)))
val joinrdd = rdd1.join(rdd2)
joinrdd.collect
Output:
Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))
Example2:
val myrdd1 = sc.parallelize(Seq((1,2),(3,4),(3,6)))
val myrdd2 = sc.parallelize(Seq((3,9)))
val myjoinedrdd = myrdd1.join(myrdd2)
myjoinedrdd.collect
Output:
Array[(Int, (Int, Int))] = Array((3,(4,9)), (3,(6,9)))