Explain Reduce-side joins?

Viewing 1 reply thread
  • Author
    Posts
    • #5654
      DataFlair TeamDataFlair Team
      Spectator

      Explain Reduce-side joins in Hadoop?
      Discuss what is Reduce-side joins in MapReduce in detail?

    • #5661
      DataFlair TeamDataFlair Team
      Spectator

      The Reduce side join is a process where the join operation is performed in the reducer phase. Basically, the reduce side join takes place in the following manner:

        <li style=”list-style-type: none”>
      • Mapper reads the input data which are to be combined based on common column or join key.
      • The mapper processes the input and adds a tag to the input to distinguish the input belonging from different sources or data sets or databases.
      • The mapper outputs the intermediate key-value pair where the key is nothing but the join key.
      • After the sorting and shuffling phase, a key and the list of values is generated for the reducer.
      • Now, the reducer joins the values present in the list with the key to give the final aggregated output.

      Like the join done in RDBMS to combine the contents of two tables based on a primary key, in a MapReduce job reduce side join combines the contents of two mapper outputs based on a common key. We must consider the data sets and decide which field will act as a primary key.

      For example, suppose there are two separate data sets, customer_detailsand transaction_details which contains the details of customers and transaction records of the customer respectively, we need to write two mappers for each data set, if we want to find the total amount spent by each customer for the products they buy at a supermarket.

      customer_details:

      Cust ID Name Phone No
      100001 Rohith 1234321543
      100002 Ankith 5498123433
      100003 Pradumnya 4352313344
      ……….. …………. …………….

      transaction_details:

      Trans ID Product Cust ID Amount
      000001 Soap 100001 100
      000002 Chocolate 100001 50
      000003 Ice Cream 100002 200
      000004 Milk 100003 125
      000005 Cheese 100003 400
      ……….. ……….. ……….. …….

      First, we need to write a two mappers for customer_details and transaction_details having the key as Cust ID in both mappers. The values of customer_details mapper will be the names of customers and values of transaction_details mapper will be the amount of transaction for each customer. The output of the mappers will go through the Sorting and Shuffling phase.

      The sorting and shuffling phase will generate an array list of values corresponding to each key. In other words, it will put together all the values corresponding to each unique key in the intermediate key-value pair.
      Now the Reducer will start working by calling the reduce() method for each unique join key (Cust ID) and the corresponding list of values. Then, the reducer will perform the join operation on the values present in the respective list of values to calculate the desired output eventually. Therefore, the number of reducer task performed will be equal to the number of unique Cust ID.

      The final output will be <key,value> pairs where customer name will be the key and the total amount spent by the customer will be the value.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.