Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › Explain Reduce-side joins?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 3:52 pm #5654DataFlair TeamSpectator
Explain Reduce-side joins in Hadoop?
Discuss what is Reduce-side joins in MapReduce in detail? -
September 20, 2018 at 3:53 pm #5661DataFlair TeamSpectator
The Reduce side join is a process where the join operation is performed in the reducer phase. Basically, the reduce side join takes place in the following manner:
-
<li style=”list-style-type: none”>
- Mapper reads the input data which are to be combined based on common column or join key.
- The mapper processes the input and adds a tag to the input to distinguish the input belonging from different sources or data sets or databases.
- The mapper outputs the intermediate key-value pair where the key is nothing but the join key.
- After the sorting and shuffling phase, a key and the list of values is generated for the reducer.
- Now, the reducer joins the values present in the list with the key to give the final aggregated output.
Like the join done in RDBMS to combine the contents of two tables based on a primary key, in a MapReduce job reduce side join combines the contents of two mapper outputs based on a common key. We must consider the data sets and decide which field will act as a primary key.
For example, suppose there are two separate data sets, customer_detailsand transaction_details which contains the details of customers and transaction records of the customer respectively, we need to write two mappers for each data set, if we want to find the total amount spent by each customer for the products they buy at a supermarket.
customer_details:
Cust ID Name Phone No
100001 Rohith 1234321543
100002 Ankith 5498123433
100003 Pradumnya 4352313344
……….. …………. …………….transaction_details:
Trans ID Product Cust ID Amount
000001 Soap 100001 100
000002 Chocolate 100001 50
000003 Ice Cream 100002 200
000004 Milk 100003 125
000005 Cheese 100003 400
……….. ……….. ……….. …….First, we need to write a two mappers for customer_details and transaction_details having the key as Cust ID in both mappers. The values of customer_details mapper will be the names of customers and values of transaction_details mapper will be the amount of transaction for each customer. The output of the mappers will go through the Sorting and Shuffling phase.
The sorting and shuffling phase will generate an array list of values corresponding to each key. In other words, it will put together all the values corresponding to each unique key in the intermediate key-value pair.
Now the Reducer will start working by calling the reduce() method for each unique join key (Cust ID) and the corresponding list of values. Then, the reducer will perform the join operation on the values present in the respective list of values to calculate the desired output eventually. Therefore, the number of reducer task performed will be equal to the number of unique Cust ID.The final output will be <key,value> pairs where customer name will be the key and the total amount spent by the customer will be the value.
-
-
AuthorPosts
- You must be logged in to reply to this topic.