Explain sortByKey() operation

Viewing 3 reply threads
  • Author
    Posts
    • #6470
      DataFlair Team
      Moderator

      Explain sortByKey() operation

    • #6471
      DataFlair Team
      Moderator

      sortByKey() is a transformation.
      > It returns an RDD sorted by Key.
      > Sorting can be done in (1) Ascending OR (2) Descending OR (3) custom sorting
      From :
      http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#212_SortByKey
      They will work with any key type K that has an implicit Ordering[K] in scope. Ordering objects already exist for all of the standard primitive types. Users can also define their own orderings for custom types, or to override the default ordering. The implicit ordering that is in the closest scope will be used.

      When called on Dataset
      of (K, V) where k is Ordered returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the ascending argument.

      Example :

      <br />
      val rdd1 = sc.parallelize(Seq(("India",91),("USA",1),("Brazil",55),("Greece",30),("China",86),("Sweden",46),("Turkey",90),("Nepal",977)))<br />
      val rdd2 = rdd1.sortByKey()<br />
      rdd2.collect();<br />


      Output:
      Array[(String,Int)] = (Array(Brazil,55),(China,86),(Greece,30),(India,91),(Nepal,977),(Sweden,46),(Turkey,90),(USA,1)

      <br />
      val rdd1 = sc.parallelize(Seq(("India",91),("USA",1),("Brazil",55),("Greece",30),("China",86),("Sweden",46),("Turkey",90),("Nepal",977)))<br />
      val rdd2 = rdd1.sortByKey(false)<br />
      rdd2.collect();<br />


      Output:
      Array[(String,Int)] = (Array(USA,1),(Turkey,90),(Sweden,46),(Nepal,977),(India,91),(Greece,30),(China,86),(Brazil,55)

    • #6472
      DataFlair Team
      Moderator

      Adding one more point on sortByKey() operation is , the result of sortByKey() is based on range-partitioned RDD
      .

      you can check with rdd2.partitioner, this returns the Option type of Partitioner object which is a Range Partitioner object.

      Using Partitioner concept, we can avoid shuffle of data across the network. This is required when you performing operations like sortByKey(), join(), cogroup()….etc. Different operation(s) has different partitioner(hash-based, range-based, or custom partitioner)

      To learn all the transformation operations, follow link: RDD Operations-Transformation & Action with Example

    • #6473
      DataFlair Team
      Moderator

      DAG will start evaluating when we call sortByKey even though we don’t action API. In general, DAG will evaluated only when we call an action. But in case of sortByKey, it start evaluating the DAG in order to compute total of partitions. Don’t think that we will get the result when we call sortByKey api. I have seen this DAG evaluation in WEB API. sortByKey api return type is RDD which is paired rdd. Out of all transformations available in spark, I think this is the only api that triggers DAG evaluation. For remaining transformation, DAG won’t be evaluated until we call an action.

Viewing 3 reply threads
  • You must be logged in to reply to this topic.