Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › RAID configuration on Hadoop cluster
- This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 11:55 am #4716DataFlair TeamSpectator
While planning / deploying hadoop cluster should we configure RAID on the cluster nodes. RAID is a popular in RDBMS, used to create data replication for reliability.
-
September 20, 2018 at 11:56 am #4717DataFlair TeamSpectator
You can but it would be a waste to configure RAID on cluster nodes. The primary reason is HDFS provides its own replication mechanism (remember replication factor).
It is worth mentioning here that around 30-40% of the disk space should be reserved for intermediate tasks e.g. MapReduce intermediate outputs and some other OS related intermediate activities.
For Data Nodes- The RAID is not required as the Hadoop ensures that it is Fault Tolerance and replicates the data in data locality awareness fashion.
For NameNodes- Name Nodes could be the point of failure as it reads the metadata in memory as well as routinely writes to disk as well. If Name nodes go down and disk crashes then we may be in serious trouble. In order to avoid this failure and as a risk mitigation, we can configure RAID on Name Nodes.
Summary
RAID is a good thing and can be leveraged on Name Nodes but it would be an overkill, slow and expensive to have it on Data Nodes.
-
-
AuthorPosts
- You must be logged in to reply to this topic.