What is Data Locality in Hadoop?

Viewing 2 reply threads
  • Author
    Posts
    • #5035
      DataFlair TeamDataFlair Team
      Spectator

      What is Data locality? What is need of Data Locality in Hadoop MapReduce?

    • #5037
      DataFlair TeamDataFlair Team
      Spectator

      Data Locality in Hadoop means moving computation close to data rather than moving data towards computation. Hadoop stores data in HDFS, which splits files into blocks and distribute among various data nodes. When a mapReduce job is submitted, it is divided into map jobs and reduce jobs. A Map job is assigned to a datanode according to the availability of the data, ie it assigns the task to a datanode which is closer to or stores the data on its local disk. Data locality refers the process of placing computation near to data , which helps in high throughput and faster execution of data.
      1. Data Local
      If a map task is executing on a node which has the input block to be processed, its called data local.
      2. Intra- Rack
      Its always not possible to run map task on the same node where data is located due to network constraints. In that case, mapper runs on another machine, but on the same rack. So the data need to be moved between the nodes for execution.
      3. Inter-Rack
      In certain cases Intra- Rack local is also not possible. In such cases, the mapper will execute from a different rack.In order to execute the mapper, the data need to be copied from the node which stores the data to the node which is executing the mapper between the racks.

      Map jobs read data from the input blocks and generate intermediate results. Since map jobs work on blocks from HDFS and are data-parallel, data locality is important for better performance and faster execution of the map jobs.

      Folow the link for more detail Data Locality in Hadoop

    • #5039
      DataFlair TeamDataFlair Team
      Spectator

      Data Locality refers to the ability to move the mapper code to the every node itself rather than moving the data towards the computation.i.e consider when there is a huge amount of data then it will be really difficult to move the data and there is a possibility pf network congestion also.

      The various categories in which Hadoop Data Locality is categorized:

      1) Data local data locality in Hadoop- If the data is present in the same node as the mapper is working.

      2) Intra-Rack data locality in Hadoop- It is always not possible to mapper on the every node so in that case, the mapper will run on the different node but in the same rack.

      3) Inter –rack data locality in Hadoop- Sometimes it is not possible to execute mapper on a different node in the same rack due to resource constraints. In that case, the mapper will execute on the nodes on different racks.
      Note: Inter –rack data locality in Hadoop is the last option.

      For more detail follow: Data Locality in Hadoop

Viewing 2 reply threads
  • You must be logged in to reply to this topic.