Should we use RAID with Hadoop

Viewing 2 reply threads
  • Author
    Posts
    • #6125
      DataFlair TeamDataFlair Team
      Spectator

      As we know Hadoop handles replication at application level, should we use RAID with Hadoop.

      What are the deciding factors for the same ?

    • #6127
      DataFlair TeamDataFlair Team
      Spectator

      HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes.

      RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk.
      If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable.

      RAID is recommended for NameNode to protect corruptions against metadata.

    • #6129
      DataFlair TeamDataFlair Team
      Spectator

      HDFS itself will take care of fault-tolerance and avoid data loss due to data redundancy/backup available in multiple data nodes. There is no need to use RAID concept an HDFS. Using RAID will make the Hadoop implementation be more expensive which will offer less storage, and also be slower depending on the RAID config.

      Since the NameNode is a single-point-of-failure in HDFS we could make use of RAID in name nodes as it requires a more reliable hardware setup.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.