Explain Reduce-side joins?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 3:52 pm #5654
  
  DataFlair Team
  Spectator
  
  Explain Reduce-side joins in Hadoop?
  Discuss what is Reduce-side joins in MapReduce in detail?
- September 20, 2018 at 3:53 pm #5661
  DataFlair Team
  Spectator
  The Reduce side join is a process where the join operation is performed in the reducer phase. Basically, the reduce side join takes place in the following manner:
  - Mapper reads the input data which are to be combined based on common column or join key.
  - The mapper processes the input and adds a tag to the input to distinguish the input belonging from different sources or data sets or databases.
  - The mapper outputs the intermediate key-value pair where the key is nothing but the join key.
  - After the sorting and shuffling phase, a key and the list of values is generated for the reducer.
  - Now, the reducer joins the values present in the list with the key to give the final aggregated output.
  Like the join done in RDBMS to combine the contents of two tables based on a primary key, in a MapReduce job reduce side join combines the contents of two mapper outputs based on a common key. We must consider the data sets and decide which field will act as a primary key.
  
  For example, suppose there are two separate data sets, customer_detailsand transaction_details which contains the details of customers and transaction records of the customer respectively, we need to write two mappers for each data set, if we want to find the total amount spent by each customer for the products they buy at a supermarket.
  
  customer_details:
  
  Cust ID Name Phone No
  100001 Rohith 1234321543
  100002 Ankith 5498123433
  100003 Pradumnya 4352313344
  ……….. …………. …………….
  
  transaction_details:
  
  Trans ID Product Cust ID Amount
  000001 Soap 100001 100
  000002 Chocolate 100001 50
  000003 Ice Cream 100002 200
  000004 Milk 100003 125
  000005 Cheese 100003 400
  ……….. ……….. ……….. …….
  
  First, we need to write a two mappers for customer_details and transaction_details having the key as Cust ID in both mappers. The values of customer_details mapper will be the names of customers and values of transaction_details mapper will be the amount of transaction for each customer. The output of the mappers will go through the Sorting and Shuffling phase.
  
  The sorting and shuffling phase will generate an array list of values corresponding to each key. In other words, it will put together all the values corresponding to each unique key in the intermediate key-value pair.
  Now the Reducer will start working by calling the reduce() method for each unique join key (Cust ID) and the corresponding list of values. Then, the reducer will perform the join operation on the values present in the respective list of values to calculate the desired output eventually. Therefore, the number of reducer task performed will be equal to the number of unique Cust ID.
  
  The final output will be <key,value> pairs where customer name will be the key and the total amount spent by the customer will be the value.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

Explain Reduce-side joins?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses