Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › What is Data Locality in Hadoop?
- This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 1:58 pm #5035DataFlair TeamSpectator
What is Data locality? What is need of Data Locality in Hadoop MapReduce?
-
September 20, 2018 at 1:59 pm #5037DataFlair TeamSpectator
Data Locality in Hadoop means moving computation close to data rather than moving data towards computation. Hadoop stores data in HDFS, which splits files into blocks and distribute among various data nodes. When a mapReduce job is submitted, it is divided into map jobs and reduce jobs. A Map job is assigned to a datanode according to the availability of the data, ie it assigns the task to a datanode which is closer to or stores the data on its local disk. Data locality refers the process of placing computation near to data , which helps in high throughput and faster execution of data.
1. Data Local
If a map task is executing on a node which has the input block to be processed, its called data local.
2. Intra- Rack
Its always not possible to run map task on the same node where data is located due to network constraints. In that case, mapper runs on another machine, but on the same rack. So the data need to be moved between the nodes for execution.
3. Inter-Rack
In certain cases Intra- Rack local is also not possible. In such cases, the mapper will execute from a different rack.In order to execute the mapper, the data need to be copied from the node which stores the data to the node which is executing the mapper between the racks.Map jobs read data from the input blocks and generate intermediate results. Since map jobs work on blocks from HDFS and are data-parallel, data locality is important for better performance and faster execution of the map jobs.
Folow the link for more detail Data Locality in Hadoop
-
September 20, 2018 at 1:59 pm #5039DataFlair TeamSpectator
Data Locality refers to the ability to move the mapper code to the every node itself rather than moving the data towards the computation.i.e consider when there is a huge amount of data then it will be really difficult to move the data and there is a possibility pf network congestion also.
The various categories in which Hadoop Data Locality is categorized:
1) Data local data locality in Hadoop- If the data is present in the same node as the mapper is working.
2) Intra-Rack data locality in Hadoop- It is always not possible to mapper on the every node so in that case, the mapper will run on the different node but in the same rack.
3) Inter –rack data locality in Hadoop- Sometimes it is not possible to execute mapper on a different node in the same rack due to resource constraints. In that case, the mapper will execute on the nodes on different racks.
Note: Inter –rack data locality in Hadoop is the last option.For more detail follow: Data Locality in Hadoop
-
-
AuthorPosts
- You must be logged in to reply to this topic.