Why RAID is not used with Hadoop HDFS

Viewing 2 reply threads
  • Author
    Posts
    • #6323
      DataFlair TeamDataFlair Team
      Spectator

      Why RAID is not used with Hadoop HDFS ?
      What are the pros and cons of RAID with HDFS

    • #6325
      DataFlair TeamDataFlair Team
      Spectator

      Let us understand what is RAID. RAID (Redundant Array of Independent Disks) is a data storage technique used by majority of the companies in the IT industry to enhance fault tolerance and optimize performance. RAID combines multiple disk drive components into a single logical unit for data redundancy or to improve performance. There are six levels of RAIDS and the data is distributed across the drives based on the RAID levels. The commonly used RAID levels in the industry are the RAID 0, RAID 1 and RAID 5.

      In RAID 0, the data is striped (split/divided) across multiple disk drives. The data is divided into small blocks(64KB or more) and distributed across multiple disks.
      RAID 1 uses the concept of data redundancy where an exact copy of data in one disk is stored in another disk
      In RAID 5, data and parity (used for recovering data in case of data loss) is used and are striped across the disks.
      The pros of RAID are:

      Performance Improvement: Since data is striped across the disks, reading/ writing the data is very fast and time saving, since reading/ writing can be done simultaneously across multiple files. This improves the disk I/O times.
      Fault Tolerance: : Data is stored redundantly across two or more disks in RAID 1 and hence in case of a disk failure, the other disks keep running and the data is available all the time. Furthermore, the parity used in RAID 5 across the disks helps in recreating the data in case there is a disk error or disk failure. Therefore, the data is not lost
      The cons of RAID are:

      There is no fault-tolerance in RAID 0. If one of the disk fails, then that will affect the entire array and data loss or data corruption increases. When a single disk eventually fails, all the data on the server is lost, forcing you to reformat all the disks and waiting for HDFS to repopulate the server with new data.
      RAID technique is not reliable. Disks tend to get slower as they age and will start getting read errors. Slow disks are a sign to replace the disks.
      Lagging performance. RAID techniques will deliver data at the rate of the slowest disk in the array. Disk speeds can vary up to 20%
      HDFS has similar mechanisms of RAID built in HADOOP. HDFS splits files into chunks (called file blocks) which are replicated across multiple datanodes and stored on their local filesystems. Usually, datanodes have multiple disks which are individually mounted (called JBOD). JBOD (Just a Bunch of Disks) treats each disk seperately. A datanode should distribute its file blocks across all its disks / local filesystems.
      The advantages are:

      Fault-tolerance: If a disk or node goes down, other replicas are available on different data nodes and disks.
      High sequential read/write performance: By splitting a file into multiple bloxks and storing them on different nodes (and different disks), a file can be read in parallel by concurrently accessing multiple disks (on different nodes). Each disk can read data with its full bandwidth and its read operations do not interfere with other disks. If the cluster is well utilized all disks will be spinning at full speed delivering the maximum sequential read performance.
      Here is a quote from Tom White’s book Hadoop: The Definitive Guide, 3rd Edition

      HDFS clusters do not benefit from using RAID (Redundant Array of Independent Disks) for datanode storage (although RAID is recommended for the namenode’s disks to protect against corruption of it’s metadata). The redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes.
      Furthermore, RAID striping (RAID 0), which is commonly used to increase performance, turns out to be slower than the JBOD (Just a Bunch Of Disks) configuration used by HDFS, which round-robins HDFS blocks between all disks. This is because RAID 0 read and write operations are limited by the speed of the slowest disk in the RAID array. In JBOD, disk operations are independent, so the average speed of operations is greater than that of the slowest disk. Disk performance often shows considerable variable in practice, even for disks of the same model. In some benchmark carried out on a Yahoo! cluster (http://markmail.org/message/xmzc45zi25htr7ry) JBOD performed 10% faster than RAID 0 in one test (Gridmix) and 30% better in another (HDFS write throughput).
      Finally, if a disk fails in a JBOD configuration, HDFS can continue to operate without the failed disk, whereas with RAID, failure of a single disk causes the whole array (and hence the node) to become unavailable.

      Since HDFS is taking care of fault-tolerance and “striped” reading, there is no need to use RAID underneath an HDFS. Using RAID will only be more expensive, offer less storage, and also be slower (depending on the RAID levels).

      Since the NameNode (master) is a single-point-of-failure in HDFS, it requires a more reliable hardware setup. Therefore, the use of RAID is recommended on namenodes and JBOD is strongly recommended for DataNodes (slaves).

    • #6326
      DataFlair TeamDataFlair Team
      Spectator

      Definition of RAID:
      RAID (redundant array of independent disks) is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both.

      Lets consider what RAID can not offer to Hadoop:

      Replication(Data stored in different Disk).
      Provides Fault Tolerance(Not in RAID0).

      Now lets compare this with HDFS:

      HDFS provides performance improvement using replication. It splits the files into blocks, which are then replicated, which in turn provides Fault-Tolerance,Performance improvement using parallel processing,High Availability.

      So using RAID won’t add any additional Advantage to HDFS.

      Lets assume RAID0 is used, with an objective to increase the performance, but when it is compared with the configuration used by HDFS i.e JBOD.
      This is because, in RAID the reading and writing operations is effected due to any slowest disk in that RAID array. But in JBOD, this operations are independent, so if we compute the average speed, its quite good.
      And one more reason RAID is not recommended is , if a single disk array goes down,
      the system become unavailable, whereas in jBOD , HDFS can operate without the failed node.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.