Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Hadoop › Explain the process of spilling in MapReduce?
September 20, 2018 at 2:11 pm #5078
What is the need of spilling in Hadoop MapReduce?
What is spill in MapReduce?September 20, 2018 at 2:11 pm #5079
Mapper task processes each input record (from RecordReader) and generates a key-value pair. The Mapper does not store its output on HDFS. Thus, this is temporary data and writing on HDFS will create unnecessary multiple copies. The Mapper writes its output into the circular memory buffer (RAM). Since, the size of the buffer is 100 MB by default, which we can change by using
Now, Spilling is a process of copying the data from the memory buffer to disc. It takes place when the content of the buffer reaches a certain threshold size. By default, a background thread starts spilling the contents after 80% of the buffer size has filled. Therefore, for a 100 MB size buffer, the spilling will start after the content of the buffer reach a size of 80MB.
Spilling is required because a checkpoint is required from which you can restart the reducers jobs. Checkpoint use those spilled records in case of a reduce task failure.September 20, 2018 at 2:11 pm #5080
After map phase intermediate values are written to the circular buffer in memory, whose size is defined by the parameter mapreduce.task.io.sort.mb (defaults to 100MB) – it is the total amount of memory allowed for the map output to occupy. If you do not fit into this amount, your data would be spilled to the disk. Be aware that with the change of the default dfs.blocksize from 64MB to 128MB even the simplest identity mapper will spill to disk, because map output buffer by default is smaller than the input split size.spilling is performed in a separate thread. It is started when the circular buffer is filled by mapreduce.map.sort.spill.percent percent, by default 0.8, which with the default size of the buffer gives 80MB. By default if your map task for a single input split outputs more than 80MB of data, this data would be spilled to local disks. Spilling works in a separate thread allowing mapper to continue functioning and processing input data while spilling happens. Mapper function is stopped only if the mapper processing rate is greater that spilling rate, so that the memory buffer got 100% full – in this case mapper function will be blocked and wait for the spill thread to free up some space.
Spilling thread writes the data to the file on the local drive of the server where the mapper function is called. The directory to write is determined based on the mapreduce.job.local.dir setting, which contains a list of the directories to be used by the MR jobs on the cluster for temporary data. One directory out of this list is chosen in a round robin fashion.
You must be logged in to reply to this topic.