RDD Partition

Viewing 1 reply thread
  • Author
    Posts
    • #4712
      DataFlair TeamDataFlair Team
      Spectator

      How can we split single HDFS block into partitions RDD?

    • #4715
      DataFlair TeamDataFlair Team
      Spectator

      When we create the RDD from a file stored in HDFS.
      data = context.textFile("/user/dataflair/file-name")
      by default one partition is created for one block. ie. if we have a file of size 1280 MB (with 128 MB block size) there will be 10 HDFS blocks, hence the similar number of partitions (10) will be created.

      If you want to create more partitions than the number of blocks, you can specify the same while RDD creation:

      data = context.textFile("/user/dataflair/file-name", 20)
      It will create 20 partitions for the file. ie for each block 2 partitions will be created.

      NOTE: It is often recommended to have more no of partitions than no of the block, it improves the performance.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.