Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › Output of one MapReduce job as input to another
- This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 5:00 pm #6038DataFlair TeamSpectator
I want to run three MapReduce jobs in sequential order, output of 1st MapReduce job should be given as input to second and output of second should be given as the input of third. How to configure the same in automated manner ?
-
September 20, 2018 at 5:00 pm #6042DataFlair TeamSpectator
There are many ways of doing sequential MapReduce jobs, i.e., providing the output of 1st MapReduce as the input of 2nd MapReduce job and so on.
One way of doing is by using ControlledJob and JobControl APIs/Classes provided by Hadoop. In a single Driver class you can build multiple jobs which have dependencies on each other.
Following code of Driver class might help to understand,
Job jobOne = new Job(jobOneConf, "Job-1"); //MR Job1 FileInputFormat.addInputPath(jobOne,jobOneInput); FileOutFormat.setOutputPath(jobOne,jobOneOutput); ControlledJob jobOneControl = new ControlledJob(jobOneConf); jobOneControl.setJob(jobOne);
Job jobTwo = new Job(jobTwoConf, "Job-2")//MR Job2 FileInputFormat.addIntputPath(jobTwo, jobOneOutput); // here we set the job-1's output as job-2's input FileOutputFormat.setOutputPath(jobTwo,jobTwoOutput); ControlledJob jobTwoControl = new ControlledJob(jobTwoConf); jobTwoControl.setJob(jobTwo);
Job jobThree = new Job(jobThreeConf, "Job-3");//MR Job3 FileInputFormat.addIntputPath(jobThree, jobTwoOutput); // here we set the job-2's output as job-3's input FileOutputFormat.setOutputPath(jobThree, jobThreeOutput); ControlledJob jobThreeControl = new ControlledJob(jobThreeConf); jobThreeControl.setJob(jobThree)
JobControl jobContrtol = new JobControl("Job-Control"); jobControl.add(jobOneControl); jobControl.add(jobTwoControl); jobTwoControl.addDependingJob(jobOneControl); // this condition makes the job-2 wait until job1 is done jobControl.add(jobThreeControl); jobThreeControl.addDependingJob(jobTwoControl); // this condition makes the job-3 wait until job2 is done Thread jobControlThread = new Thread(jobControl); jobControlThread.start(); jobControlThread.join();
-
September 20, 2018 at 5:00 pm #6044DataFlair TeamSpectator
There are many ways you can do it.
(1) Cascading jobs
Create the JobConf object “job1” for the first job and set all the parameters with “input” as inputdirectory and “temp” as output directory. Execute this job:
JobClient.run(job1).
Immediately below it, create the JobConf object “job2” for the second job and set all the parameters with “temp” as inputdirectory and “output” as output directory. Execute this job:JobClient.run(job2).
(2) Create two JobConf objects and set all the parameters in them just like (1) except that you don’t use JobClient.run.Then create two Job objects with jobconfs as parameters:
Job job1=new Job(jobconf1);
Job job2=new Job(jobconf2);
Using the jobControl object, you specify the job dependencies and then run the jobs:JobControl jbcntrl=new JobControl(“jbcntrl”);
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
job2.addDependingJob(job1);
jbcntrl.run();
(3) If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards.
-
-
AuthorPosts
- You must be logged in to reply to this topic.