Comparison Between Hadoop 2.x vs Hadoop 3.x

Boost your career with Data Engineering Courses!!

1. Objective

In this Hadoop tutorial, we will discuss the Comparison between Hadoop 2.x vs Hadoop 3.x. What are the new features added in Hadoop version 3, is Hadoop 2 programs compatible in Hadoop 3, what are the difference between Hadoop 2 and Hadoop 3? We hope that this Feature wise difference between Hadoop 2 and Hadoop 3. will help you to answer the above questions.

2. Feature-wise Comparison Between Hadoop 2.x vs Hadoop 3.x

This section will let you know the Top 22 differences between Hadoop 2.x vs Hadoop 3.x. Let us now discuss each feature one by one-

i. License

Hadoop 2.x – Apache 2.0, Open Source
Hadoop 3.x – Apache 2.0, Open Source

ii. Minimum supported version of Java

Hadoop 2.x – Minimum supported version of java is java 7.
Hadoop 3.x – Minimum supported version of java is java 8

iii. Fault Tolerance

Hadoop 2.x – Fault tolerance can be handled by replication (which is wastage of space).
Hadoop 3.x – Fault tolerance can be handled by Erasure coding.

iv. Data Balancing

Hadoop 2.x – For data, balancing uses HDFS balancer.
Hadoop 3.x – For data, balancing uses Intra-data node balancer, which is invoked via the HDFS disk balancer CLI.

v. Storage Scheme

Hadoop 2.x – Uses 3X replication scheme
Hadoop 3.x – Support for erasure encoding in HDFS.

vi. Storage Overhead

Hadoop 2.x – HDFS has 200% overhead in storage space.
Hadoop 3.x – Storage overhead is only 50%.

vii. Storage Overhead Example

Hadoop 2.x – If there is 6 block so there will be 18 blocks occupied the space because of the replication scheme.
Hadoop 3.x – If there is 6 block so there will be 9 blocks occupied the space 6 block and 3 for parity.

viii. YARN Timeline Service

Hadoop 2.x – Uses an old timeline service which has scalability issues.
Hadoop 3.x – Improve the timeline service v2 and improves the scalability and reliability of timeline service.

ix. Default Ports Range

Hadoop 2.x – In Hadoop 2.0 some default ports are Linux ephemeral port range. So at the time of startup, they will fail to bind.
Hadoop 3.x – But in Hadoop 3.0 these ports have been moved out of the ephemeral range.

x. Tools

Hadoop 2.x – Uses Hive, pig, Tez, Hama, Giraph and other Hadoop tools.
Hadoop 3.x – Hive, pig, Tez, Hama, Giraph and other Hadoop tools are available.

Learn Apache Hadoop Ecosystem Components in detail.

xi. Compatible File System

Hadoop 2.x – HDFS (Default FS), FTP File system: This stores all its data on remotely accessible FTP servers. Amazon S3 (Simple Storage Service) file system Windows Azure Storage Blobs (WASB) file system.
Hadoop 3.x – It supports all the previous one as well as Microsoft Azure Data Lake filesystem.

xii. Datanode Resources

Hadoop 2.x – Datanode resource is not dedicated for the MapReduce we can use it for other application.
Hadoop 3.x – Here also data node resources can be used for other Applications too.

xiii. MR API Compatibility

Hadoop 2.x – MR API compatible with Hadoop 1.x program to execute on Hadoop 2.X
Hadoop 3.x – Here also MR API is compatible with running Hadoop 1.x programs to execute on Hadoop 3.X

xiv. Support for Microsoft Windows

Hadoop 2.x – It can be deployed on windows.
Hadoop 3.x – It also supports for Microsoft windows.

xv. Slots/Container

Hadoop 2.x – Hadoop 1 works on the concept of slots but Hadoop 2.X works on the concept of the container. Through in the container, we can run the generic task.
Hadoop 3.x – It also works on the concept of a container.

xvi. Single Point of Failure

Hadoop 2.x – Has Features to overcome SPOF so whenever Namenode fails it recovers automatically.
Hadoop 3.x – Has Feature to overcome SPOF so whenever Namenode fails it recovers automatically no needs manual intervention to overcome it.

xvii. HDFS Federation

Hadoop 2.x – In Hadoop 1.0, only single NameNode to manage all Namespace but in Hadoop 2.0, multiple NameNode for multiple Namespace.
Hadoop 3.x – Hadoop 3.x also have multiple Namenode for multiple namespaces.

Learn Hadoop HDFS Federation in detail.

xviii. Scalability

Hadoop 2.x – We can scale up to 10,000 Nodes per cluster.
Hadoop 3.x – Better scalability. we can scale more than 10,000 nodes per cluster.

xix. Faster Access to Data

Hadoop 2.x – Due to data Node caching we can fast access the data.
Hadoop 3.x – Here also through Datanode caching we can fast access the data.

xx. HDFS Snapshot

Hadoop 2.x – Hadoop 2 adds the support for a snapshot. It provides disaster recovery and protection for user error.
Hadoop 3.x – Hadoop 2 also support for the snapshot feature.

xxi. Platform

Hadoop 2.x – Can serve as a platform for a wide variety of data analytics possible to run event processing, streaming, and real-time operations.
Hadoop 3.x – Here also it is possible to run event processing, streaming and real-time operation on the top of YARN.

xxii. Cluster Resource Management

Hadoop 2.x – For cluster resource Management it uses YARN. It improves scalability, high availability, Multi-tenancy.
Hadoop 3.x – For a cluster, resource Management Uses YARN, with all the features.

3. Conclusion

As we have discussed 22 important differences between Hadoop 2.x vs Hadoop 3.x, now we can decide which is better between Hadoop 2 and Hadoop 3 for installation. We are providing both Hadoop 2.x installation on Ubuntu and Hadoop 3.x installation on Ubuntu or convenience.
If you like this blog or found any other difference between Hadoop 2.x vs Hadoop 3.x, so please let us know by leaving a comment below.
See Also-

Reference

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google

Rohan says:
December 16, 2017 at 11:50 am
Does the Hadoop 3.0 still support the 3-way data duplication (replication) in Hadoop 2.x?
Thanks.
- GK says:
  December 15, 2019 at 12:16 am
  Yes, 3.x does support 3-way replication and it is enabled by default and EC is in disabled. You can enable EC through configuration keys.
Mmagdy says:
April 2, 2018 at 1:55 pm
very useful article
Kingsley Sikot says:
January 5, 2019 at 11:32 pm
Great job! This differences answered many questions i had.
- DataFlair Team says:
  January 9, 2019 at 2:51 pm
  Hi Kingsley,
  Thanks for the appreciation on Hadoop 2.x vs Hadoop 3.x tutorial. We have a great verity of Apache Hadoop Tutorial. We hope you are referring to them as well.
  Keep Sharing and Keep Learning
  Regards,
  DataFlair
Madhan says:
February 19, 2019 at 5:00 pm
The article is crystal clear. I appreciate it.
- DataFlair Team says:
  February 20, 2019 at 11:36 am
  Hello Madhan,
  We are glad, our reader like our Hadoop 2.x Vs Hadoop 3.x. We have more tutorials on Hadoop, please refer them as well.
  Keep Visiting DataFlair
Rakesh A N says:
January 20, 2020 at 6:10 pm
How does production queue concept different in Hadoop 3.x compared to Hadoop 2.x ?
shubham says:
February 15, 2021 at 8:18 pm
is hadoop 3.x version available ?

Comparison Between Hadoop 2.x vs Hadoop 3.x

1. Objective

2. Feature-wise Comparison Between Hadoop 2.x vs Hadoop 3.x

i. License

ii. Minimum supported version of Java

iii. Fault Tolerance

iv. Data Balancing

v. Storage Scheme

vi. Storage Overhead

vii. Storage Overhead Example

viii. YARN Timeline Service

ix. Default Ports Range

x. Tools

xi. Compatible File System

xii. Datanode Resources

xiii. MR API Compatibility

xiv. Support for Microsoft Windows

xv. Slots/Container

xvi. Single Point of Failure

xvii. HDFS Federation

xviii. Scalability

xix. Faster Access to Data

xx. HDFS Snapshot

xxi. Platform

xxii. Cluster Resource Management

3. Conclusion

9 Responses

Leave a Reply Cancel reply

About DataFlair

Trending Courses

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Data Science Tutorials

Trending Projects

Trending Programming Tutorials

Trending Tutorials