Explain Hadoop streaming?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 5:30 pm #6252
  
  DataFlair Team
  Spectator
  
  What is Hadoop streaming?
- September 20, 2018 at 5:30 pm #6254
  DataFlair Team
  Spectator
  By default MapReduce framework is written in Java and supports writing mapreduce programs in Java programming language but Hadoop provides API for writing mapreduce programs in other than Java Language.
  
  Hadoop Streaming is an utility that comes with hadoop distribution and allows users to write mapreduce programs in any programming/scripting language that can read standard input (stdin) and write to standard output (stdout).
```
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/
hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc
```
  Both mapper and reducer scripts read the input from standard input and emit the output to standard output. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
  
  When a script is specified for mappers, each mapper task will launch the script as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the standard input (STDIN) of the process. In the meantime, the mapper collects the line-oriented outputs from the standard output (STDOUT) of the process and converts each line into a key-value pairs, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then the entire line is considered as the key and the value is null. However, this can be customized, as per one need.
  
  When a script is specified for reducers, each reducer task will launch the script as a separate process, then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the standard input (STDIN) of the process. In the meantime, the reducer collects the line-oriented outputs from the standard output (STDOUT) of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized as per specific requirements.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

Explain Hadoop streaming?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses