Comparison Between Hadoop 2.x vs Hadoop 3.x


1. Objective

In this Hadoop tutorial, we will discuss the Comparison between Hadoop 2.x vs Hadoop 3.x. What are the new features added in Hadoop version 3, is Hadoop 2 programs compatible in Hadoop 3, what are the difference between Hadoop 2 and Hadoop 3? We hope that this Feature wise difference between Hadoop 2 and Hadoop 3. will help you to answer the above questions.

Detailed feature wise comparison between Apache Hadoop 2.x vs Hadoop 3.x.

2. Feature wise Comparison Between Hadoop 2.x vs Hadoop 3.x

This section will let you know the Top 22 differences between Hadoop 2.x vs Hadoop 3.x. Let us now discuss each feature one by one-

2.1. License

  • Hadoop 2.x – Apache 2.0, Open Source
  • Hadoop 3.x – Apache 2.0, Open Source

2.2. Minimum supported version of Java

  • Hadoop 2.x – Minimum supported version of java is java 7.
  • Hadoop 3.x – Minimum supported version of java is java 8

2.3. Fault Tolerance

  • Hadoop 2.x – Fault tolerance can be handled by replication (which is wastage of space).
  • Hadoop 3.x – Fault tolerance can be handled by Erasure coding.

2.4. Data Balancing

  • Hadoop 2.x – For data, balancing uses HDFS balancer.
  • Hadoop 3.x – For data, balancing uses Intra-data node balancer, which is invoked via the HDFS disk balancer CLI.

2.5. Storage Scheme

  • Hadoop 2.x – Uses 3X replication scheme
  • Hadoop 3.x – Support for erasure encoding in HDFS.

2.6. Storage Overhead

  • Hadoop 2.x – HDFS has 200% overhead in storage space.
  • Hadoop 3.x – Storage overhead is only 50%.

2.7. Storage Overhead Example

  • Hadoop 2.x – If there is 6 block so there will be 18 blocks occupied the space because of the replication scheme.
  • Hadoop 3.x – If there is 6 block so there will be 9 blocks occupied the space 6 block and 3 for parity.

2.8. YARN Timeline Service

  • Hadoop 2.x – Uses an old timeline service which has scalability issues.
  • Hadoop 3.x – Improve the timeline service v2 and improves the scalability and reliability of timeline service.

2.9. Default Ports Range

  • Hadoop 2.x – In Hadoop 2.0 some default ports are Linux ephemeral port range. So at the time of startup, they will fail to bind.
  • Hadoop 3.x – But in Hadoop 3.0 these ports have been moved out of the ephemeral range.

2.10. Tools

  • Hadoop 2.x – Uses Hive, pig, Tez, Hama, Giraph and other Hadoop tools.
  • Hadoop 3.x – Hive, pig, Tez, Hama, Giraph and other Hadoop tools are available.

Learn Apache Hadoop Ecosystem Components in detail.

2.11. Compatible File System

  • Hadoop 2.x – HDFS (Default FS), FTP File system: This stores all its data on remotely accessible FTP servers. Amazon S3 (Simple Storage Service) file system Windows Azure Storage Blobs (WASB) file system.
  • Hadoop 3.x – It supports all the previous one as well as Microsoft Azure Data Lake filesystem.

2.12. Datanode Resources

  • Hadoop 2.x – Datanode resource is not dedicated for the MapReduce we can use it for other application.
  • Hadoop 3.x – Here also data node resources can be used for other Applications too.

2.13. MR API Compatibility

  • Hadoop 2.x – MR API compatible with Hadoop 1.x program to execute on Hadoop 2.X
  • Hadoop 3.x – Here also MR API is compatible with running Hadoop 1.x programs to execute on Hadoop 3.X

2.14. Support for Microsoft Windows

  • Hadoop 2.x – It can be deployed on windows.
  • Hadoop 3.x – It also supports for Microsoft windows.

2.15. Slots/Container

  • Hadoop 2.x – Hadoop 1 works on the concept of slots but Hadoop 2.X works on the concept of the container. Through in the container, we can run the generic task.
  • Hadoop 3.x – It also works on the concept of a container.

2.16. Single Point of Failure

  • Hadoop 2.x – Has Features to overcome SPOF so whenever Namenode fails it recovers automatically.
  • Hadoop 3.x – Has Feature to overcome SPOF so whenever Namenode fails it recovers automatically no needs manual intervention to overcome it.

2.17. HDFS Federation

  • Hadoop 2.x – In Hadoop 1.0, only single NameNode to manage all Namespace but in Hadoop 2.0, multiple NameNode for multiple Namespace.
  • Hadoop 3.x – Hadoop 3.x also have multiple Namenode for multiple namespaces.

Learn Hadoop HDFS Federation in detail.

2.18. Scalability

  • Hadoop 2.x – We can scale up to 10,000 Nodes per cluster.
  • Hadoop 3.x – Better scalability. we can scale more than 10,000 nodes per cluster.

2.19. Faster Access to Data

  • Hadoop 2.x – Due to data Node caching we can fast access the data.
  • Hadoop 3.x – Here also through Datanode caching we can fast access the data.

2.20. HDFS Snapshot

  • Hadoop 2.x – Hadoop 2 adds the support for a snapshot. It provides disaster recovery and protection for user error.
  • Hadoop 3.x – Hadoop 2 also support for the snapshot feature.

2.21. Platform

  • Hadoop 2.x – Can serve as a platform for a wide variety of data analytics possible to run event processing, streaming, and real-time operations.
  • Hadoop 3.x – Here also it is possible to run event processing, streaming and real-time operation on the top of YARN.

2.22. Cluster Resource Management

  • Hadoop 2.x – For cluster resource Management it uses YARN. It improves scalability, high availability, Multi-tenancy.
  • Hadoop 3.x – For a cluster, resource Management Uses YARN, with all the features.

3. Conclusion

As we have discussed 22 important differences between Hadoop 2.x vs Hadoop 3.x, now we can decide which is better between Hadoop 2 and Hadoop 3 for installation. We are providing both Hadoop 2.x installation on Ubuntu and Hadoop 3.x installation on Ubuntu or convenience.

If you like this blog or found any other difference between Hadoop 2.x vs Hadoop 3.x, so please let us know by leaving a comment below.

See Also-

Leave a comment

Your email address will not be published. Required fields are marked *