RAID configuration on Hadoop cluster

Viewing 1 reply thread
  • Author
    Posts
    • #4716
      DataFlair TeamDataFlair Team
      Spectator

      While planning / deploying hadoop cluster should we configure RAID on the cluster nodes. RAID is a popular in RDBMS, used to create data replication for reliability.

    • #4717
      DataFlair TeamDataFlair Team
      Spectator

      You can but it would be a waste to configure RAID on cluster nodes. The primary reason is HDFS provides its own replication mechanism (remember replication factor).

      It is worth mentioning here that around 30-40% of the disk space should be reserved for intermediate tasks e.g. MapReduce intermediate outputs and some other OS related intermediate activities.

      For Data Nodes- The RAID is not required as the Hadoop ensures that it is Fault Tolerance and replicates the data in data locality awareness fashion.

      For NameNodes- Name Nodes could be the point of failure as it reads the metadata in memory as well as routinely writes to disk as well. If Name nodes go down and disk crashes then we may be in serious trouble. In order to avoid this failure and as a risk mitigation, we can configure RAID on Name Nodes.

      Summary
      RAID is a good thing and can be leveraged on Name Nodes but it would be an overkill, slow and expensive to have it on Data Nodes.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.