Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Apache Hadoop › Suppose Hadoop spawned 100 tasks for a job & 1 task failed. what will Hadoop do?
September 20, 2018 at 4:14 pm #5798
Suppose Hadoop spawned 100 tasks for a job and one of the tasks failed. What will Hadoop do?
September 20, 2018 at 4:14 pm #5800
If a task failed then Application manager will get to know about the failure and it will mark the task as failed and clears then the container. Application manager will re schedule the task and prefers different node. It will follow the process until the task fails four times( the number can be configured by mapreduce.map.maxattempts and mapreduce.reduce.maxattempts property). If the job fails four times then the job will be marked as the failure.
In some case less number of task failure can be ignored, in that case, we can define the failure task percentage(by mapreduce.map.failures.maxpercent property), Hadoop will consider this and according to this, it will mark a job success or failure.
September 20, 2018 at 4:14 pm #5802
If a task is failed, Hadoop will detects failed tasks and reschedules replacements on machines that are healthy.
It will terminate the task only if the task fails more than four times which is default setting that can be changes it kill terminate the job.
Number of retries can be configured using mapred.map.max.attempts
September 20, 2018 at 4:15 pm #5804
In the real world, user code is buggy, processes crash, and machines fail. One of the
major benefits of using Hadoop is its ability to handle such failures and allow your job
So, basically we will first understand the failure cases and causes
Task Failure: (There are many cases wherein the tasks may fail)
Case 1: Child Task Failure
This happens when user code in the map or reduce task throws a runtime exception.
If this happens, the child JVM reports the error back to its parent tasktracker, before it exits.
The error ultimately makes it into the user logs. The tasktracker marks the task attempt as
failed, freeing up a slot to run another task.
Case 2: For Streaming Task failures
For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is marked
This behavior is governed by the stream.non.zero.exit.is.failure property
(the default is true).
Case 3: Sudden failure or exit of the task in JVM due to JVM bug
Another failure mode is the sudden exit of the child JVM—perhaps there is a JVM bug
that causes the JVM to exit for a particular set of circumstances exposed by the Map-
Reduce user code. In this case, the tasktracker notices that the process has exited, and
marks the attempt as failed.
Case 4: Hanging tasks
These are dealt differently. The tasktracker notices that it hasn’t received
a progress update for a while, and proceeds to mark the task as failed. The child JVM
process will be automatically killed after this period.
*The timeout period after which tasks are considered failed is normally 10 minutes, and can be configured on a per-job basis (or a cluster basis) by setting the mapred.task.timeout property to a value in milliseconds.
Setting the timeout to a value of zero disables the timeout, so long-running tasks are never marked as failed. In this case, a hanging task will never free up its slot, and over time there may be cluster slowdown as a result. This approach should be avoided, and always making sure that a task is reporting progress periodically.
Hadoop’s way to handle such task failures:
When the jobtracker is notified of a task attempt that has failed (by the tasktracker’s heartbeat call) it will reschedule execution of the task. The jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously failed. Furthermore, if a task fails more than four times, it will not be retried further. This value is configurable: the maximum number of attempts to run a task is controlled by the mapred.map.max.attempts property for map tasks, and mapred.reduce.max.attempts for reduce tasks. By default, if any task fails more than four times (or whatever the maximum number of attempts is configured to), the whole job fails.
For some applications it is undesirable to abort the job if a few tasks fail, as it may be possible to use the results of the job despite some failures. In this case, the maximum percentage of tasks that are allowed to fail without triggering job failure can be set for the job. Map tasks and reduce tasks are controlled independently, using the mapred.max.map.failures.percent and mapred.max.reduce.failures.percent properties.
A task attempt may also be killed, which is different from it failing. A task attempt may be killed because it is a speculative duplicate, or because the tasktracker it was running on failed, and the jobtracker marked all the task attempts running on it as killed. Killed task attempts do not count against the number of attempts to run the task (as set by mapred.map.max.attempts and mapred.reduce.max.attempts), since it wasn’t the task’s fault that an attempt was killed.
Users may also kill or fail task attempts using the web UI or the command line (type hadoop job to see the options). Jobs may also be killed by the same mechanisms.
- You must be logged in to reply to this topic.