Output of one MapReduce job as input to another

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 5:00 pm #6038
  
  DataFlair Team
  Spectator
  
  I want to run three MapReduce jobs in sequential order, output of 1st MapReduce job should be given as input to second and output of second should be given as the input of third. How to configure the same in automated manner ?
- September 20, 2018 at 5:00 pm #6042
  DataFlair Team
  Spectator
  There are many ways of doing sequential MapReduce jobs, i.e., providing the output of 1st MapReduce as the input of 2nd MapReduce job and so on.
  
  One way of doing is by using ControlledJob and JobControl APIs/Classes provided by Hadoop. In a single Driver class you can build multiple jobs which have dependencies on each other.
  
  Following code of Driver class might help to understand,
```
Job jobOne = new Job(jobOneConf, "Job-1"); //MR Job1
FileInputFormat.addInputPath(jobOne,jobOneInput);
FileOutFormat.setOutputPath(jobOne,jobOneOutput);
ControlledJob jobOneControl = new
ControlledJob(jobOneConf);
jobOneControl.setJob(jobOne);
```
```
Job jobTwo = new Job(jobTwoConf, "Job-2")//MR Job2
FileInputFormat.addIntputPath(jobTwo, jobOneOutput);
// here we set the job-1's output as job-2's input
FileOutputFormat.setOutputPath(jobTwo,jobTwoOutput);
ControlledJob jobTwoControl = new
ControlledJob(jobTwoConf);
jobTwoControl.setJob(jobTwo);
```
```
Job jobThree = new Job(jobThreeConf, "Job-3");//MR Job3
FileInputFormat.addIntputPath(jobThree,
 jobTwoOutput);
// here we set the job-2's output as job-3's input
FileOutputFormat.setOutputPath(jobThree,
jobThreeOutput);
ControlledJob jobThreeControl = new
ControlledJob(jobThreeConf);
jobThreeControl.setJob(jobThree)
```
```
JobControl jobContrtol = new JobControl("Job-Control");
jobControl.add(jobOneControl);
jobControl.add(jobTwoControl);
jobTwoControl.addDependingJob(jobOneControl);
// this condition makes the job-2 wait until job1
is done
jobControl.add(jobThreeControl);
jobThreeControl.addDependingJob(jobTwoControl);
// this condition makes the job-3 wait until job2
is done

Thread jobControlThread = new Thread(jobControl);
    jobControlThread.start();
    jobControlThread.join();
```
- September 20, 2018 at 5:00 pm #6044
  
  DataFlair Team
  Spectator
  
  There are many ways you can do it.
  
  (1) Cascading jobs
  
  Create the JobConf object “job1” for the first job and set all the parameters with “input” as inputdirectory and “temp” as output directory. Execute this job:
  
  JobClient.run(job1).
  Immediately below it, create the JobConf object “job2” for the second job and set all the parameters with “temp” as inputdirectory and “output” as output directory. Execute this job:
  
  JobClient.run(job2).
  (2) Create two JobConf objects and set all the parameters in them just like (1) except that you don’t use JobClient.run.
  
  Then create two Job objects with jobconfs as parameters:
  
  Job job1=new Job(jobconf1);
  Job job2=new Job(jobconf2);
  Using the jobControl object, you specify the job dependencies and then run the jobs:
  
  JobControl jbcntrl=new JobControl(“jbcntrl”);
  jbcntrl.addJob(job1);
  jbcntrl.addJob(job2);
  job2.addDependingJob(job1);
  jbcntrl.run();
  (3) If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

Output of one MapReduce job as input to another

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses