What is Data locality in Hadoop MapReduce


1. Objective

In this Data Locality tutorial, we will learn what is meant by data locality in Hadoop, how Hadoop exploits Data Locality, what is the need of Hadoop Data Locality, various types of data locality in Hadoop, Data locality optimization in Hadoop and various advantages of Hadoop data locality.

data locality in hadoop mapreduce

2. Data locality in Hadoop

Let us understand what is data locality in MapReduce.

The major drawback of Hadoop was cross-switch network traffic due to the huge volume of data. To overcome this drawback, Data Locality came into the picture. Data locality refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

In Hadoop, datasets are stored in HDFS. Datasets are divided into blocks and stored across the datanodes in Hadoop cluster. When a user runs the MapReduce job then NameNode sent this MapReduce code to the datanodes on which data is available related to MapReduce job.

3. Requirements for Hadoop data locality

Our system architecture needs to satisfy following conditions, in order to get the benefits of all the advantages of data locality:

  • First of all the cluster should have the appropriate topology. Hadoop code must have the ability to read data locality.
  • Second, Hadoop must be aware of the topology of the nodes where tasks are executed. And Hadoop must know where the data is located.

4. Categories of Data locality in Hadoop

Below are the various categories in which Hadoop Data Locality is categorized:

a. Data local data locality in Hadoop

When the data is located on the same node as the mapper working on the data it is known as data local data locality. In this case, the proximity of data is very near to computation. This is the most preferred scenario.

b. Intra-Rack data locality in Hadoop

It is not always possible to execute the mapper on the same datanode due to resource constraints. In such case, it is preferred to run the mapper on the different node but on the same rack.

c. Inter –rack data locality in Hadoop

Sometimes it is not possible to execute mapper on a different node in the same rack due to resource constraints. In such case, we will execute the mapper on the nodes on different racks. This is the least preferred scenario.

5. Hadoop Data locality optimization

Although Data locality is the main advantage of Hadoop MapReduce as map code is executed on the same datanode where data resides. But this is not always true in practice due to various reasons like speculative execution in Hadoop, Heterogeneous cluster, Data distribution and placement, and Data Layout and Input Splitter.

In large clusters challenges become more prevalent, as more the number of data nodes and data, less will be the locality. In larger clusters, some nodes are newer and faster than the other, creating the data to compute ratio out of balance thus, large clusters tend not be completely homogenous. In speculative execution even though the data might not be local, but it uses the compute power. The root cause also lies in the data layout/placement and the used Input Splitter. Non-local data processing puts a strain on the network which creates problem to scalability. Thus the network becomes the bottleneck.

We can improve data locality by first detecting which jobs has the data locality problem or degrade over time. Problem-solving is more complex and involves changing the data placement and data layout, using a different scheduler or by simply changing the number of mapper and reducer slots for a job. Then we have to verify whether a new execution of the same workload has a better data locality ratio.

6. Advantages of data locality in Hadoop

a. Faster Execution

In data locality, program is moved to the node where data resides instead of moving large data to the node, this makes hadoop faster. Because the size of the program is always lesser than the size of data, so moving data is a bottleneck of network transfer.

b. High Throughput

Data locality increases the overall throughput of the system.

Leave a comment

Your email address will not be published. Required fields are marked *