Explain join() operation

Viewing 1 reply thread
  • Author
    Posts
    • #5101
      DataFlair TeamDataFlair Team
      Spectator

      Explain join() operation

    • #5104
      DataFlair TeamDataFlair Team
      Spectator

      > join() is transformation.
      > It’s in package org.apache.spark.rdd.pairRDDFunction

      def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]Permalink

      Return an RDD containing all pairs of elements with matching keys in this and other.
      Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Performs a hash join across the cluster.

      From :
      http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#213_Join

      It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

      Example1:

      val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))
      val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64)))
      val joinrdd = rdd1.join(rdd2)
      joinrdd.collect

      Output:

      Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

      Example2:

      val myrdd1 = sc.parallelize(Seq((1,2),(3,4),(3,6)))
      val myrdd2 = sc.parallelize(Seq((3,9)))
      val myjoinedrdd = myrdd1.join(myrdd2)
      myjoinedrdd.collect

      Output:
      Array[(Int, (Int, Int))] = Array((3,(4,9)), (3,(6,9)))

Viewing 1 reply thread
  • You must be logged in to reply to this topic.