Why Hadoop MapReduce uses key-value pair to process the data?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop Why Hadoop MapReduce uses key-value pair to process the data?

Viewing 3 reply threads
  • Author
    Posts
    • #6260
      DataFlair TeamDataFlair Team
      Spectator

      Why is key value pair needed in Hadoop MapReduce?

    • #6261
      DataFlair TeamDataFlair Team
      Spectator

      MapReduce is based on original Google MapReduce Paper in terms of key-value pairs. The main reason behind using Key-Value pair for processing is because many real world tasks are expressible in this model, as shown in the Google MapReduce paper.

      MapReduce is a programming model for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

      Another reason is that this MapReduce programming model uses Key, Value which is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, Fault-Tolerance, locality optimization, and load balancing.

      Third reason, implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes efficient use of these machine resources and therefore is suitable for use on many of the
      large computational problems.

      For more detail follow: key-value pairs in Hadoop

    • #6262
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is based on Googles MapReduce concept of Key and Value pairs called tuples.

      The input to mapper is ( key1,value1) and output is called intermediate output which is also
      (Keyi, valuei). This is the input to the Reducer task and final key output value is the output of Reducer
      (Key2,value2)
      (Key1, value1)— Mapper—(keyi,valuei)— Reducer(key2,value2).

      The computations involves certain Map operations on each records of the input data to get an intermediate key/values pairs on which Reducer operation is done on values assoiciated with same key.By this Parallelism and locality optimization can be optained.

      Follow the link for morwe detail: key-value pairs in Hadoop

    • #6264
      DataFlair TeamDataFlair Team
      Spectator

      Unlike Relational databases, Hadoop is not just restricted to deal with structured data but also semi/unstructured data.
      For data analysis, as the data is structured it was easy to map any data using the columns. But Hadoop dealing with variety of data and it being a distributed and parallel processing system, it is important to logically correlate the input that is given by the user.
      Generally, in map reduce job, map usually breaks the data in a logical way and reduce job aggregates the data .
      The output from map has to be given as input to reducer , for it to be able to aggregate all the data that logically belongs together. Using key value pairs makes it easier to do this logical mapping.

      Follow the link for morwe detail: key-value pairs in Hadoop

Viewing 3 reply threads
  • You must be logged in to reply to this topic.