This topic contains 1 reply, has 1 voice, and was last updated by  dfbdteam5 9 months, 4 weeks ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #5101

    dfbdteam5
    Moderator

    Explain join() operation

    #5104

    dfbdteam5
    Moderator

    > join() is transformation.
    > It’s in package org.apache.spark.rdd.pairRDDFunction

    def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]Permalink

    Return an RDD containing all pairs of elements with matching keys in this and other.
    Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Performs a hash join across the cluster.

    From :
    http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#213_Join

    It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

    Example1:

    val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))
    val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64)))
    val joinrdd = rdd1.join(rdd2)
    joinrdd.collect

    Output:

    Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

    Example2:

    val myrdd1 = sc.parallelize(Seq((1,2),(3,4),(3,6)))
    val myrdd2 = sc.parallelize(Seq((3,9)))
    val myjoinedrdd = myrdd1.join(myrdd2)
    myjoinedrdd.collect

    Output:
    Array[(Int, (Int, Int))] = Array((3,(4,9)), (3,(6,9)))

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.