RAID configuration on Hadoop cluster

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 11:55 am #4716
  
  DataFlair Team
  Spectator
  
  While planning / deploying hadoop cluster should we configure RAID on the cluster nodes. RAID is a popular in RDBMS, used to create data replication for reliability.
- September 20, 2018 at 11:56 am #4717
  
  DataFlair Team
  Spectator
  
  You can but it would be a waste to configure RAID on cluster nodes. The primary reason is HDFS provides its own replication mechanism (remember replication factor).
  
  It is worth mentioning here that around 30-40% of the disk space should be reserved for intermediate tasks e.g. MapReduce intermediate outputs and some other OS related intermediate activities.
  
  For Data Nodes- The RAID is not required as the Hadoop ensures that it is Fault Tolerance and replicates the data in data locality awareness fashion.
  
  For NameNodes- Name Nodes could be the point of failure as it reads the metadata in memory as well as routinely writes to disk as well. If Name nodes go down and disk crashes then we may be in serious trouble. In order to avoid this failure and as a risk mitigation, we can configure RAID on Name Nodes.
  
  Summary
  RAID is a good thing and can be leveraged on Name Nodes but it would be an overkill, slow and expensive to have it on Data Nodes.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.