Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › By Default, how many partitions are created in RDD in Apache Spark ?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 12:57 pm #4957DataFlair TeamSpectator
-
<li style=”list-style-type: none”>
- By Default, Spark creates one Partition for each block of the file (For HDFS)
- Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
- However, one can explicitly specify the number of partitions to be created.
Example1:
-
<li style=”list-style-type: none”>
- No Partition is not specified
val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt")
Example2:
-
<li style=”list-style-type: none”>
- Following code create the RDD of 10 partitions, since we specify the no. of partitions.
val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt", 10)
One can query about the number of partitions in following way :
rdd1.partitions.length OR rdd1.getNumPartitions
- Best case Scenario is that we should make RDD in following way:
numbers of cores in Cluster = no. of partitions
-
September 20, 2018 at 12:57 pm #4958DataFlair TeamSpectator
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)
Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block).
For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 20)
-
-
AuthorPosts
- You must be logged in to reply to this topic.