This topic contains 3 replies, has 1 voice, and was last updated by  dfbdteam5 8 months ago.

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #6470

    dfbdteam5
    Moderator

    Explain sortByKey() operation

    #6471

    dfbdteam5
    Moderator

    sortByKey() is a transformation.
    > It returns an RDD sorted by Key.
    > Sorting can be done in (1) Ascending OR (2) Descending OR (3) custom sorting
    From :
    http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#212_SortByKey
    They will work with any key type K that has an implicit Ordering[K] in scope. Ordering objects already exist for all of the standard primitive types. Users can also define their own orderings for custom types, or to override the default ordering. The implicit ordering that is in the closest scope will be used.

    When called on Dataset
    of (K, V) where k is Ordered returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the ascending argument.

    Example :

    <br />
    val rdd1 = sc.parallelize(Seq(("India",91),("USA",1),("Brazil",55),("Greece",30),("China",86),("Sweden",46),("Turkey",90),("Nepal",977)))<br />
    val rdd2 = rdd1.sortByKey()<br />
    rdd2.collect();<br />


    Output:
    Array[(String,Int)] = (Array(Brazil,55),(China,86),(Greece,30),(India,91),(Nepal,977),(Sweden,46),(Turkey,90),(USA,1)

    <br />
    val rdd1 = sc.parallelize(Seq(("India",91),("USA",1),("Brazil",55),("Greece",30),("China",86),("Sweden",46),("Turkey",90),("Nepal",977)))<br />
    val rdd2 = rdd1.sortByKey(false)<br />
    rdd2.collect();<br />


    Output:
    Array[(String,Int)] = (Array(USA,1),(Turkey,90),(Sweden,46),(Nepal,977),(India,91),(Greece,30),(China,86),(Brazil,55)

    #6472

    dfbdteam5
    Moderator

    Adding one more point on sortByKey() operation is , the result of sortByKey() is based on range-partitioned RDD
    .

    you can check with rdd2.partitioner, this returns the Option type of Partitioner object which is a Range Partitioner object.

    Using Partitioner concept, we can avoid shuffle of data across the network. This is required when you performing operations like sortByKey(), join(), cogroup()….etc. Different operation(s) has different partitioner(hash-based, range-based, or custom partitioner)

    To learn all the transformation operations, follow link: RDD Operations-Transformation & Action with Example

    #6473

    dfbdteam5
    Moderator

    DAG will start evaluating when we call sortByKey even though we don’t action API. In general, DAG will evaluated only when we call an action. But in case of sortByKey, it start evaluating the DAG in order to compute total of partitions. Don’t think that we will get the result when we call sortByKey api. I have seen this DAG evaluation in WEB API. sortByKey api return type is RDD which is paired rdd. Out of all transformations available in spark, I think this is the only api that triggers DAG evaluation. For remaining transformation, DAG won’t be evaluated until we call an action.

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.