Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Apache Spark By Default, how many partitions are created in RDD in Apache Spark ?

This topic contains 2 replies, has 1 voice, and was last updated by  dfbdteam5 8 months, 1 week ago.

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
  • #6493


    By Default, how many partitions are created in RDD in Apache Spark ?


      <li style=”list-style-type: none”>
    • By Default, Spark creates one Partition for each block of the file (For HDFS)
    • Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
    • However, one can explicitly specify the number of partitions to be created.


      <li style=”list-style-type: none”>
    • No Partition is not specified
    val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt")


      <li style=”list-style-type: none”>
    • Following code create the RDD of 10 partitions, since we specify the no. of partitions.
    val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt", 10)

    One can query about the number of partitions in following way :

    • Best case Scenario is that we should make RDD in following way:

      numbers of cores in Cluster = no. of partitions


    val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)

    Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block).

    For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.

    val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 20)

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.