In this Hadoop tutorial, we will discuss the Hadoop Features, characteristics and design principles of Hadoop. In this blog we will also discuss the assumptions around which the Hadoop in built.
To install and configure Hadoop follow this installation guide.
2. Hadoop Features and Characteristics
Apache Hadoop is the most popular and powerful big data tool, Hadoop provides world’s most reliable storage layer – HDFS, a batch Processing engine – MapReduce and a Resource Management Layer – YARN. In this section of Hadoop tutorial, we will discuss important Hadoop Features which are given below-
2.1. Open Source
Apache Hadoop is an open source project. It means its code can be modified according to business requirements.
2.2. Distributed Processing
As data is stored in a distributed manner in HDFS across the cluster, data is processed in parallel on a cluster of nodes.
2.3. Fault Tolerance
By default 3 replicas of each block is stored across the cluster in Hadoop and it can be changed also as per the requirement. So if any node goes down, data on that node can be recovered from other nodes easily. Failures of nodes or tasks are recovered automatically by the framework. This is how Hadoop is fault tolerant.
Due to replication of data in the cluster, data is reliably stored on the cluster of machine despite machine failures. If your machine goes down, then also your data will be stored reliably.
2.5. High Availability
Data is highly available and accessible despite hardware failure due to multiple copies of data. If a machine or few hardware crashes, then data will be accessed from another path.
Hadoop is highly scalable in the way new hardware can be easily added to the nodes. It also provides horizontal scalability which means new nodes can be added on the fly without any downtime.
Apache Hadoop is not very expensive as it runs on a cluster of commodity hardware. We do not need any specialized machine for it. Hadoop also provides huge cost saving also as it is very easy to add more nodes on the fly here. So if requirement increases, then you can increase nodes as well without any downtime and without requiring much of pre-planning.
2.8. Easy to use
No need of client to deal with distributed computing, the framework takes care of all the things. So it is easy to use.
2.9. Data Locality
Hadoop works on data locality principle which states that move computation to data instead of data to computation. When a client submits the MapReduce algorithm, this algorithm is moved to data in the cluster rather than bringing data to the location where the algorithm is submitted and then processing it.
3. Hadoop Assumptions
Hadoop is written with large clusters of computers in mind and is built around the following assumptions:
- Hardware may fail, (as commodity hardware can be used)
- Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency.
- Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size.
- Applications need a write-once-read-many access model.
- Moving Computation is Cheaper than Moving Data.
4. Hadoop Design Principles
Below are the design principles on which Hadoop works:
a) System shall manage and heal itself
- Automatically and transparently route around failure (Fault Tolerant)
- Speculatively execute redundant tasks if certain nodes are detected to be slow
b) Performance shall scale linearly
- Proportional change in capacity with resource change (Scalability)
c) Computation should move to data
- Lower latency, lower bandwidth (Data Locality)
d) Simple core, modular and extensible (Economical)