Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › Explain sortByKey() operation
- This topic has 3 replies, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 10:26 pm #6470DataFlair TeamSpectator
Explain sortByKey() operation
-
September 20, 2018 at 10:27 pm #6471DataFlair TeamSpectator
> sortByKey() is a transformation.
> It returns an RDD sorted by Key.
> Sorting can be done in (1) Ascending OR (2) Descending OR (3) custom sorting
From :
http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#212_SortByKey
They will work with any key type K that has an implicit Ordering[K] in scope. Ordering objects already exist for all of the standard primitive types. Users can also define their own orderings for custom types, or to override the default ordering. The implicit ordering that is in the closest scope will be used.When called on Dataset
of (K, V) where k is Ordered returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the ascending argument.Example :
<br /> val rdd1 = sc.parallelize(Seq(("India",91),("USA",1),("Brazil",55),("Greece",30),("China",86),("Sweden",46),("Turkey",90),("Nepal",977)))<br /> val rdd2 = rdd1.sortByKey()<br /> rdd2.collect();<br />
Output:
Array[(String,Int)] = (Array(Brazil,55),(China,86),(Greece,30),(India,91),(Nepal,977),(Sweden,46),(Turkey,90),(USA,1)
<br /> val rdd1 = sc.parallelize(Seq(("India",91),("USA",1),("Brazil",55),("Greece",30),("China",86),("Sweden",46),("Turkey",90),("Nepal",977)))<br /> val rdd2 = rdd1.sortByKey(false)<br /> rdd2.collect();<br />
Output:
Array[(String,Int)] = (Array(USA,1),(Turkey,90),(Sweden,46),(Nepal,977),(India,91),(Greece,30),(China,86),(Brazil,55) -
September 20, 2018 at 10:27 pm #6472DataFlair TeamSpectator
Adding one more point on sortByKey() operation is , the result of sortByKey() is based on range-partitioned RDD
.you can check with rdd2.partitioner, this returns the Option type of Partitioner object which is a Range Partitioner object.
Using Partitioner concept, we can avoid shuffle of data across the network. This is required when you performing operations like sortByKey(), join(), cogroup()….etc. Different operation(s) has different partitioner(hash-based, range-based, or custom partitioner)
To learn all the transformation operations, follow link: RDD Operations-Transformation & Action with Example
-
September 20, 2018 at 10:27 pm #6473DataFlair TeamSpectator
DAG will start evaluating when we call sortByKey even though we don’t action API. In general, DAG will evaluated only when we call an action. But in case of sortByKey, it start evaluating the DAG in order to compute total of partitions. Don’t think that we will get the result when we call sortByKey api. I have seen this DAG evaluation in WEB API. sortByKey api return type is RDD which is paired rdd. Out of all transformations available in spark, I think this is the only api that triggers DAG evaluation. For remaining transformation, DAG won’t be evaluated until we call an action.
-
-
AuthorPosts
- You must be logged in to reply to this topic.