Output of one MapReduce job as input to another

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop Output of one MapReduce job as input to another

Viewing 2 reply threads
  • Author
    Posts
    • #6038
      DataFlair TeamDataFlair Team
      Spectator

      I want to run three MapReduce jobs in sequential order, output of 1st MapReduce job should be given as input to second and output of second should be given as the input of third. How to configure the same in automated manner ?

    • #6042
      DataFlair TeamDataFlair Team
      Spectator

      There are many ways of doing sequential MapReduce jobs, i.e., providing the output of 1st MapReduce as the input of 2nd MapReduce job and so on.

      One way of doing is by using ControlledJob and JobControl APIs/Classes provided by Hadoop. In a single Driver class you can build multiple jobs which have dependencies on each other.

      Following code of Driver class might help to understand,

      Job jobOne = new Job(jobOneConf, "Job-1"); //MR Job1
      FileInputFormat.addInputPath(jobOne,jobOneInput);
      FileOutFormat.setOutputPath(jobOne,jobOneOutput);
      ControlledJob jobOneControl = new
      ControlledJob(jobOneConf);
      jobOneControl.setJob(jobOne);
      Job jobTwo = new Job(jobTwoConf, "Job-2")//MR Job2
      FileInputFormat.addIntputPath(jobTwo, jobOneOutput);
      // here we set the job-1's output as job-2's input
      FileOutputFormat.setOutputPath(jobTwo,jobTwoOutput);
      ControlledJob jobTwoControl = new
      ControlledJob(jobTwoConf);
      jobTwoControl.setJob(jobTwo);
      Job jobThree = new Job(jobThreeConf, "Job-3");//MR Job3
      FileInputFormat.addIntputPath(jobThree,
       jobTwoOutput);
      // here we set the job-2's output as job-3's input
      FileOutputFormat.setOutputPath(jobThree,
      jobThreeOutput);
      ControlledJob jobThreeControl = new
      ControlledJob(jobThreeConf);
      jobThreeControl.setJob(jobThree)
      JobControl jobContrtol = new JobControl("Job-Control");
      jobControl.add(jobOneControl);
      jobControl.add(jobTwoControl);
      jobTwoControl.addDependingJob(jobOneControl);
      // this condition makes the job-2 wait until job1
      is done
      jobControl.add(jobThreeControl);
      jobThreeControl.addDependingJob(jobTwoControl);
      // this condition makes the job-3 wait until job2
      is done
      
      Thread jobControlThread = new Thread(jobControl);
          jobControlThread.start();
          jobControlThread.join();
    • #6044
      DataFlair TeamDataFlair Team
      Spectator

      There are many ways you can do it.

      (1) Cascading jobs

      Create the JobConf object “job1” for the first job and set all the parameters with “input” as inputdirectory and “temp” as output directory. Execute this job:

      JobClient.run(job1).
      Immediately below it, create the JobConf object “job2” for the second job and set all the parameters with “temp” as inputdirectory and “output” as output directory. Execute this job:

      JobClient.run(job2).
      (2) Create two JobConf objects and set all the parameters in them just like (1) except that you don’t use JobClient.run.

      Then create two Job objects with jobconfs as parameters:

      Job job1=new Job(jobconf1);
      Job job2=new Job(jobconf2);
      Using the jobControl object, you specify the job dependencies and then run the jobs:

      JobControl jbcntrl=new JobControl(“jbcntrl”);
      jbcntrl.addJob(job1);
      jbcntrl.addJob(job2);
      job2.addDependingJob(job1);
      jbcntrl.run();
      (3) If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.