By Default, how many partitions are created in RDD in Apache Spark ?

Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Apache Hadoop By Default, how many partitions are created in RDD in Apache Spark ?

Viewing 1 reply thread
  • Author
    Posts
    • #4957
      DataFlair Team
      Participant
        <li style=”list-style-type: none”>
      • By Default, Spark creates one Partition for each block of the file (For HDFS)
      • Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
      • However, one can explicitly specify the number of partitions to be created.

      Example1:

        <li style=”list-style-type: none”>
      • No Partition is not specified
      val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt")

      Example2:

        <li style=”list-style-type: none”>
      • Following code create the RDD of 10 partitions, since we specify the no. of partitions.
      val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt", 10)

      One can query about the number of partitions in following way :

      rdd1.partitions.length
      
      OR
      
      rdd1.getNumPartitions
      • Best case Scenario is that we should make RDD in following way:

        numbers of cores in Cluster = no. of partitions
    • #4958
      DataFlair Team
      Participant

      val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)

      Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block).

      For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.

      val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 20)

Viewing 1 reply thread
  • You must be logged in to reply to this topic.