Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Hadoop By Default, how many partitions are created in RDD in Apache Spark ?

This topic contains 1 reply, has 1 voice, and was last updated by  dfbdteam3 9 months, 4 weeks ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #4957

    dfbdteam3
    Moderator
      <li style=”list-style-type: none”>
    • By Default, Spark creates one Partition for each block of the file (For HDFS)
    • Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
    • However, one can explicitly specify the number of partitions to be created.

    Example1:

      <li style=”list-style-type: none”>
    • No Partition is not specified
    val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt")

    Example2:

      <li style=”list-style-type: none”>
    • Following code create the RDD of 10 partitions, since we specify the no. of partitions.
    val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt", 10)

    One can query about the number of partitions in following way :

    rdd1.partitions.length
    
    OR
    
    rdd1.getNumPartitions
    • Best case Scenario is that we should make RDD in following way:

      numbers of cores in Cluster = no. of partitions
    #4958

    dfbdteam3
    Moderator

    val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)

    Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block).

    For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.

    val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 20)

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.