By Default, how many partitions are created in RDD in Apache Spark ?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 12:57 pm #4957
  DataFlair Team
  Spectator
  - By Default, Spark creates one Partition for each block of the file (For HDFS)
  - Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
  - However, one can explicitly specify the number of partitions to be created.
  Example1:
  - No Partition is not specified
```
val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt")
```
  Example2:
  - Following code create the RDD of 10 partitions, since we specify the no. of partitions.
```
val rdd1 = sc.textFile("/home/hdadmin/wc-data.txt", 10)
```
  One can query about the number of partitions in following way :
```
rdd1.partitions.length

OR

rdd1.getNumPartitions
```
  - Best case Scenario is that we should make RDD in following way:
    
    numbers of cores in Cluster = no. of partitions
- September 20, 2018 at 12:57 pm #4958
  
  DataFlair Team
  Spectator
  
  val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)
  
  Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block).
  
  For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.
  
  val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 20)
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

By Default, how many partitions are created in RDD in Apache Spark ?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses