Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Apache Spark › Explain sortByKey() operation
September 20, 2018 at 10:26 pm #6470
Explain sortByKey() operation
September 20, 2018 at 10:27 pm #6471
> sortByKey() is a transformation.
> It returns an RDD sorted by Key.
> Sorting can be done in (1) Ascending OR (2) Descending OR (3) custom sorting
They will work with any key type K that has an implicit Ordering[K] in scope. Ordering objects already exist for all of the standard primitive types. Users can also define their own orderings for custom types, or to override the default ordering. The implicit ordering that is in the closest scope will be used.
When called on Dataset
of (K, V) where k is Ordered returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the ascending argument.
<br /> val rdd1 = sc.parallelize(Seq(("India",91),("USA",1),("Brazil",55),("Greece",30),("China",86),("Sweden",46),("Turkey",90),("Nepal",977)))<br /> val rdd2 = rdd1.sortByKey()<br /> rdd2.collect();<br />
Array[(String,Int)] = (Array(Brazil,55),(China,86),(Greece,30),(India,91),(Nepal,977),(Sweden,46),(Turkey,90),(USA,1)
<br /> val rdd1 = sc.parallelize(Seq(("India",91),("USA",1),("Brazil",55),("Greece",30),("China",86),("Sweden",46),("Turkey",90),("Nepal",977)))<br /> val rdd2 = rdd1.sortByKey(false)<br /> rdd2.collect();<br />
Array[(String,Int)] = (Array(USA,1),(Turkey,90),(Sweden,46),(Nepal,977),(India,91),(Greece,30),(China,86),(Brazil,55)
September 20, 2018 at 10:27 pm #6472
Adding one more point on sortByKey() operation is , the result of sortByKey() is based on range-partitioned RDD
you can check with rdd2.partitioner, this returns the Option type of Partitioner object which is a Range Partitioner object.
Using Partitioner concept, we can avoid shuffle of data across the network. This is required when you performing operations like sortByKey(), join(), cogroup()….etc. Different operation(s) has different partitioner(hash-based, range-based, or custom partitioner)
To learn all the transformation operations, follow link: RDD Operations-Transformation & Action with Example
September 20, 2018 at 10:27 pm #6473
DAG will start evaluating when we call sortByKey even though we don’t action API. In general, DAG will evaluated only when we call an action. But in case of sortByKey, it start evaluating the DAG in order to compute total of partitions. Don’t think that we will get the result when we call sortByKey api. I have seen this DAG evaluation in WEB API. sortByKey api return type is RDD which is paired rdd. Out of all transformations available in spark, I think this is the only api that triggers DAG evaluation. For remaining transformation, DAG won’t be evaluated until we call an action.
- You must be logged in to reply to this topic.