50 MapReduce Interview Questions and Answers


Objective

It is difficult to pass the Hadoop interview as it is fast and growing technology. To get you through this tough path the interview questions will serve as the backbone. this blog contains the commonly asked Hadoop MapReduce interview questions and answers.

These 50 MapReduce Interview Questions are framed by keeping in mind the need of an era, and the trending pattern of the interview that is being followed by the companies. The interview questions of Hadoop MapReduce are dedicatedly framed by the company experts. To help you to reach your goal.

All the best!!!!

Top 50 Hadoop MapReduce Interview Questions and Answers for Hadoop Jobs.

Top Hadoop MapReduce Interview Questions and Answers

1) What is MapReduce in Hadoop?

Hadoop MapReduce is the data processing layer. It is the framework for writing applications that process the vast amount of data stored in the HDFS.

It processes a huge amount of data in parallel by dividing the job into a set of independent tasks (sub-job). In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce.

  • Map – It is the first phase of processing. In which we specify all the complex logic/business rules/costly code. The map takes a set of data and converts it into another set of data. It also breaks individual elements into tuples (key-value pairs).
  • Reduce – It is the second phase of processing. In which we specify light-weight processing like aggregation/summation. The output from the map is the input to Reducer. Then, Reducer combines tuples (key-value) based on the key. And then, modifies the value of the key accordingly.

Read about Hadoop MapReduce in Detail.

2) What is the need of MapReduce in Hadoop?

In Hadoop, when we have stored the data in HDFS, how to process this data is the first question that arises? Transferring all this data to a central node for processing is not going to work. And we will have to wait forever for the data to transfer over the network. Google faced this same problem with its Distributed Goggle File System (GFS). It solved this problem using a MapReduce data processing model.

Challenges before MapReduce

  • Time-consuming – By using single machine we cannot analyze the data (terabytes) as it will take a lot of time.
  • Costly – All the data (terabytes) in one server or as database cluster which is very expensive. And also hard to manage.

MapReduce overcome these challenges

  • Time-efficient – If we want to analyze the data. We can write the analysis code in Map function. And the integration code in Reduce function and execute it. Thus, this MapReduce code will go to every machine which has a part of our data and executes on that specific part. Hence instead of moving terabytes of data, we just move kilobytes of code. So this type of movement is time-efficient.
  • Cost-efficient – It distributes the data over multiple low config machines.

Hadoop MapReduce Job Execution Flow Diagram.

3) What is Mapper in Hadoop?

Mapper task processes each input record (from RecordReader) and generates a key-value pair. This key-value pairs generated by mapper is completely different from the input pair. The Mapper store intermediate-output on the local disk. Thus, it does not store its output on HDFS. It is temporary data and writing on HDFS will create unnecessary multiple copies. Mapper only understands key-value pairs of data. So before passing data to the mapper, it, first converts the data into key-value pairs.

Mapper only understands key-value pairs of data. So before passing data to the mapper, it, first converts the data into key-value pairs. InputSplit and RecordReader convert data into key-value pairs. Input split is the logical representation of data. RecordReader communicates with the InputSplit and converts the data into Kay-value pairs. Hence

  • Key is a reference to the input value.
  • Value is the data set on which to operate.

Number of maps depends on the total size of the input. i.e. the total number of blocks of the input files. Mapper= {(total data size)/ (input split size)} If data size= 1 Tb and input split size= 100 MB Hence, Mapper= (1000*1000)/100= 10,000

Mapper= {(total data size)/ (input split size)} If data size= 1 Tb and input split size= 100 MB Hence, Mapper= (1000*1000)/100= 10,000

If data size= 1 Tb and input split size= 100 MB HenceMapper= (1000*1000)/100= 10,000

Mapper= (1000*1000)/100= 10,000

Read about Mapper in detail.

4) What is Reducer in Hadoop?

Reducer takes the output of the Mapper (intermediate key-value pair) as the input. After that, it runs a reduce function on each of them to generate the output. Thus the output of the reducer is the final output, which it stored in HDFS. Usually, in Reducer, we do aggregation or summation sort of computation. Reducer has three primary phases-

  • Shuffle- The framework, fetches the relevant partition of the output of all the Mappers for each reducer via HTTP.
  • Sort- The framework groups Reducers inputs by the key in this Phase. Shuffle and sort phases occur simultaneously.
  • Reduce- After shuffling and sorting, reduce task aggregates the key-value pairs. In this phase, call the reduce (Object, Iterator, OutputCollector, Reporter) method for each <key, (list of values)> pair in the grouped inputs.

With the help of Job.setNumreduceTasks(int) the user set the number of reducers for the job.
Hence, right number of reducers is 0.95 or 1.75 multiplied by (<no. of nodes>*<no. of maximum container per node>)

Read about Reducer in detail.

5) How to set mappers and reducers for MapReduce jobs?

One can configure JobConf to set number of mappers and reducers.

  • For Mapper – job.setNumMaptasks()
  • For Reducer – job.setNumreduceTasks()

6) What is the key- value pair in Hadoop MapReduce?

Hadoop MapReduce implements a data model, which represents data as key-value pairs. Both input and output to MapReduce Framework should be in Key-value pairs only. In Hadoop, if a schema is static we can directly work on the column instead of key-value. But, the schema is not static we will work on keys and values. Keys and values are not the intrinsic properties of the data. But the user analyzing the data chooses a key-value pair.

A Key-value pair in Hadoop MapReduce generate in following way:

  • InputSplit – It is the logical representation of data. InputSplit represents the data which individual Mapper will process.
  • RecordReader – It converts the split into records which are in form of Key-value pairs. That is suitable for reading by the mapper.
    By Default RecordReader uses TextInputFormat for converting data into a key-value pair.
  • Key – It is the byte offset of the beginning of the line within the file.
  • Value – It is the contents of the line, excluding line terminators. For

For Example, file content is- on the top of the crumpetty Tree

Key- 0

Value- on the top of the crumpetty Tree

Read about MapReduce Key-value pair in detail.

7) What is the need of key-value pair to process the data in MapReduce?

Hadoop MapReduce works on unstructured and semi-structured data apart from structured data. One can read the Structured data like the ones stored in RDBMS by columns.

But handling unstructured data is feasible using key-value pairs. The very core idea of MapReduce work on the basis of these pairs. Framework map data into a collection of key-value pairs by mapper and reducer on all the pairs with the same key.

In most of the computations- Map operation applies on each logical “record” in our input. This computes a set of intermediate key-value pairs. Then apply reduce operation on all the values that share the same key. This combines the derived data properly.

Thus, we can say that key-value pairs are the best solution to work on data problems on MapReduce.

8) What are the most common InputFormats in Hadoop?

In Hadoop, Input files store the data for a MapReduce job. Input files which stores data typically reside in HDFS. Thus, in MapReduce, InputFormat defines how these input files split and read. InputFormat creates InputSplit.

Most common InputFormat are:

  • FileInputFormat – For all file-based InputFormat it is the base class . It also specifies input directory where data files are present. FileInputFormat also read all files. And, then divides these files into one or more InputSplits.
  • TextInputFormat – It is the default InputFormat of MapReduce. It uses each line of each input file as a separate record. Thus, performs no parsing.
    Key- byte offset.
    Value- It is the contents of the line, excluding line terminators.
  • KeyValueTextInputFormat – It also treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’).
    Key- Everything up to tab character.
    Value- Remaining part of the line after tab character.
  • SequenceFileInputFormat – It reads sequence files.
    Key & Value- Both are user-defined.

Read about Mapreduce InputFormat in detail.

9) Explain InputSplit in Hadoop?

InputFormat creates InputSplit. InputSplit is the logical representation of data. Further Hadoop framework divides InputSplit into records. Then mapper will process each record. The size of split is approximately equal to HDFS block size (128 MB). In MapReduce program, Inputsplit is user defined. So, the user can control split size based on the size of data.

InputSplit in mapreduce has a length in bytes. It also has set of storage locations (hostname strings). It use storage location to place map tasks as close to split’s data as possible. According to the inputslit size each Map tasks process. So that the largest one gets processed first, this minimize the job runtime. In MapReduce, important thing is that InputSplit is just a reference to the data, not contain input data.

By calling ‘getSplit()’ client who is running job calculate the split for the job . And then send to the application master and it will use their storage location to schedule map tasks. And that will process them on the cluster. In MapReduce, split is send to the createRecordReader() method. It will create RecordReader for the split in mapreduce job. Then RecordReader generate record (key-value pair). Then it passes to the map function.

Read about MapReduce InputSplit in detail. 

10) Explain the difference between a MapReduce InputSplit and HDFS block.

By definition-

  • Block – It is the smallest unit of data that the file system store. In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster.
  • InputSplit – InputSplit represents the data which individual Mapper will process. Further split divides into records. Each record (which is a key-value pair) will be processed by the map.

Size-

  • Block – The default size of the HDFS block is 128 MB which is configured as per our requirement. All blocks of the file are of the same size except the last block. The last Block can be of same size or smaller. In Hadoop, the files split into 128 MB blocks and then stored into Hadoop Filesystem.
  • InputSplit – Split size is approximately equal to block size, by default.

Data representation-

  • Block – It is the physical representation of data.
  • InputSplit – It is the logical representation of data. Thus, during data processing in MapReduce program or other processing techniques use InputSplit. In MapReduce, important thing is that InputSplit does not contain the input data. Hence, it is just a reference to the data.

Read more differences between MapReduce InputSplit and HDFS block.

11) What is the purpose of RecordReader in hadoop?

RecordReader in Hadoop uses the data within the boundaries, defined by InputSplit. It creates key-value pairs for the mapper. The “start” is the byte position in the file. Thus at ‘start the RecordReader should start generating key-value pairs. And the “end” is where it should stop reading records.

RecordReader in MapReduce job load data from its source. And then, converts the data into key-value pairs suitable for reading by the mapper. RecordReader communicates with the InputSplit until it does not read the complete file. The MapReduce framework defines RecordReader instance by the InputFormat. By, default RecordReader also uses TextInputFormat for converting data into key-value pairs.

TextInputFormat provides 2 types of RecordReader : LineRecordReader and SequenceFileRecordReader.

LineRecordReader in Hadoop is the default RecordReader that TextInputFormat provides. Hence, each line of the input file is the new value and the key is byte offset.

SequenceFileRecordReader in Hadoop reads data specified by the header of a sequence file.

Read about MapReduce RecordReder in detail.

12) What is Combiner in Hadoop?

In MapReduce job, Mapper Generate large chunks of intermediate data. Then pass it to reduce for further processing. All this leads to enormous network congestion. Hadoop MapReduce framework provides a function known as Combiner. It plays a key role in reducing network congestion.

The Combiner in Hadoop is also known as Mini-reducer that performs local aggregation on the mapper’s output. This reduces the data transfer between mapper and reducer and increases the efficiency.

There is no guarantee of execution of Ccombiner in Hadoop i.e. Hadoop may or may not execute a combiner. Also if required it may execute it more than 1 times. Hence, your MapReduce jobs should not depend on the Combiners execution.

Read about MapReduce Combiner in detail.

13) Explain about the partitioning, shuffle and sort phase in MapReduce?

Partitioning Phase – Partitioning specifies that all the values for each key are grouped together. Then make sure that all the values of a single key go on the same Reducer. Thus allows even distribution of the map output over the Reducer.

Shuffle Phase – It is the process by which the system sorts the key-value output of the map tasks. After that it transfer to the reducer.

Sort Phase – Mapper generate the intermediate key-value pair. Before starting of Reducer, map reduce framework sort these key-value pairs by the keys. It also helps reducer to easily distinguish when a new reduce task should start. Thus saves time for the reducer.

Read about Shuffling and Sorting in detail.

14) What does a “MapReduce Partitioner” do?

Partitioner comes int the picture, if we are working on more than one reducer. Partitioner controls the partitioning of the keys of the intermediate map-outputs. By hash function, the key (or a subset of the key) is used to derive the partition. Partitioning specifies that all the values for each key grouped together. And it make sure that all the values of single key goes on the same reducer. Thus allowing even distribution of the map output over the reducers. It redirects the Mappers output to the reducers by determining which reducer is responsible for the particular key.

The total number of partitioner is equal to the number of Reducer. Partitioner in Hadoop will divide the data according to the number of reducers. Thus, single reducer process the data from the single partitioner.

Read about MapReduce Partitioner in detail.

15) If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?

So, Hadoop MapReduce by default uses ‘HashPartitioner’.

It uses the hashCode() method to determine, to which partition a given (key, value) pair will be sent. HashPartitioner also has a method called getPartition.

HashPartitioner also takes key.hashCode() & integer>MAX_VALUE. It takes these code to finds the modulus using the number of reduce tasks. Suppose there are 10 reduce tasks, then getPartition will return values 0 through 9 for all keys.

Public class HashPartitioner<k, v>extends Partitioner<k, v> 
{ 
Public int getpartitioner(k key, v value, int numreduceTasks) 
{ 
Return (key.hashCode() & Integer.Max_VALUE) % numreduceTasks; 
} 
}

16) How to write a custom partitioner for a Hadoop MapReduce job?

It stores the results uniformly across different reducers, based on the user condition.

By setting a Partitioner to partition by the key, we can guarantee that records for the same key will go the same reducer. It also ensures that only one reducer receives all the records for that particular key.

By the following steps, we can write Custom partitioner for a Hadoop MapReduce job:

  • Create a new class that extends Partitioner Class.
  • Then, Override method getPartition, in the wrapper that runs in the MapReduce.
  • By using method set Partitioner class, add the custom partitioner to the job. Or add the custom partitioner to the job as config file.

17) What is shuffling and sorting in Hadoop MapReduce?

Shuffling and Sorting takes place after the completion of map task. Shuffle and sort phase in Hadoop occurs simultaneously.

  • Shuffling- Shuffling is the process by which the system sorts the key-value output of the map tasks and transfer it to the reducer. Shuffle phase is important for reducers, otherwise, they would not have any input. As shuffling can start even before the map phase has finished. So this saves some time and completes the task in lesser time.
  • Sorting- Mapper generate the intermediate key-value pair. Before starting of reducer, mapreduce framework sort these key-value pair by the keys. It also helps reducer to easily distinguish when a new reduce task should start. Thus saves time for the reducer.

Shuffling and sorting are not performed at all if you specify zero reducer (setNumReduceTasks(0))

Read about Shuffling and Sorting in detail.

18) Why aggregation cannot be done in Mapper?

Mapper task processes each input record (From RecordReader) and generates a key-value pair. The Mapper store intermediate-output on the local disk.
We cannot perform aggregation in mapper because:

  • Sorting takes place only on the Reducer function. Thus there is no provision for sorting in the mapper function. Without sorting aggregation is not possible.
  • To perform aggregation, we need the output of all the Mapper function. Thus, which may not be possible to collect in the map phase. Because mappers may be running on different machines where the data blocks are present.
  • If we will try to perform aggregation of data at mapper, it requires communication between all mapper functions. Which may be running on different machines. Thus, this will consume high network bandwidth and can cause network bottlenecking.

19) Explain map-only job?

MapReduce is the data processing layer of Hadoop. It is the framework for writing applications that process the vast amount of data stored in the HDFS. It processes the huge amount of data in parallel by dividing the job into a set of independent tasks (sub-job). In Hadoop, MapReduce have 2 phases of processing: Map and Reduce.

In Map phase we specify all the complex logic/business rules/costly code. Map takes a set of data and converts it into another set of data. It also break individual elements into tuples (key-value pairs). In Reduce phase we specify light-weight processing like aggregation/summation. Reduce takes the output from the map as input. After that it combines tuples (key-value) based on the key. And then, modifies the value of the key accordingly.

Consider a case where we just need to perform the operation and no aggregation required. Thus, in such case, we will prefer “Map-Only job” in Hadoop. In Map-Only job, the map does all task with its InputSplit and the reducer do no job. Map output is the final output.

This we can achieve by setting job.setNumreduceTasks(0) in the configuration in a driver. This will make a number of reducer 0 and thus only mapper will be doing the complete task.

Read about map-only job in Hadoop Mapreduce in detail.

20) What is SequenceFileInputFormat in Hadoop MapReduce?

SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence files are binary files that stores sequences of binary key-value pairs. These files are block-compress. Thus, Sequence files provide direct serialization and deserialization of several arbitrary data types.
Here Key and value- Both are user-defined.

SequenceFileAsTextInputFormat is variant of SequenceFileInputFormat. It converts the sequence file’s key value to text objects. Hence, by calling ‘tostring()’ it performs conversion on the keys and values. Thus, this InputFormat makes sequence files suitable input for streaming.

SequenceFileAsBinaryInputFormat is variant of SequenceFileInputFormat. Hence, by using this we can extract the sequence file’s keys and values as an opaque binary object.

21) What is KeyValueTextInputFormat in Hadoop?

KeyValueTextInputFormat- It treats each line of input as a separate record. It breaks the line itself into key and value. Thus, it uses the tab character (‘/t’) to break the line into a key-value pair.
Key- Everything up to tab character.
Value- Remaining part of the line after tab character.

Consider the following input file, where → represents a (horizontal) tab character:

But→ his face you could not see

Account→ of his beaver hat Hence,

Output:

Key- But
Value- his face you could not see
Key- Account
Value- of his beaver hat

22) Differentiate Reducer and Combiner in Hadoop MapReduce?

Combiner- The combiner is Mini-Reducer that perform local reduce task. It run on the Map output and produces the output to reducer input. Combiner is usually used for network optimization.

Reducer- Reducer takes a set of an intermediate key-value pair produced by the mapper as the input. Then runs a reduce function on each of them to generate the output. An output of the reducer is the final output.

  • Unlike a reducer, the combiner has a limitation . i.e. the input or output key and value types must match the output types of the mapper.
  • Combiners can operate only on a subset of keys and values . i.e. combiners can execute on functions that are commutative.
  • Combiner functions take input from a single mapper. While reducers can take data from multiple mappers as a result of partitioning.

23) Explain the process of spilling in MapReduce?

Map task processes each input record (from RecordReader) and generates a key-value pair. The Mapper does not store its output on HDFS. Thus, this is temporary data and writing on HDFS will create unnecessary multiple copies. The Mapper writes its output into the circular memory buffer (RAM). Size of the buffer is 100 MB by default. We can also change it by using mapreduce.task.io.sort.mb property.

Now, spilling is a process of copying the data from the memory buffer to disc. It takes place when the content of the buffer reaches a certain threshold size. So, background thread by default starts spilling the contents after 80% of the buffer size has filled. Therefore, for a 100 MB size buffer, the spilling will start after the content of the buffer reach a size of 80MB.

24) What happen if number of reducer is set to 0 in Hadoop?

If we set the number of reducer to 0:

  • Then no reducer will execute and no aggregation will take place.
  • In such case we will prefer “Map-only job” in Hadoop. In map-Only job, the map does all task with its InputSplit and the reducer do no job. Map output is the final output.

In between map and reduce phases there is key, sort, and shuffle phase. Sort and shuffle phase are responsible for sorting the keys in ascending order. Then grouping values based on same keys. This phase is very expensive. If reduce phase is not required we should avoid it. Avoiding reduce phase would eliminate sort and shuffle phase as well. This also saves network congestion. As in shuffling an output of mapper travels to reducer,when data size is huge, large data travel to reducer.

25) What is Speculative Execution in Hadoop?

MapReduce breaks jobs into tasks and run these tasks parallely rather than sequentially. Thus reduces execution time. This model of execution is sensitive to slow tasks as they slow down the overall execution of a job. There are various reasons for the slowdown of tasks like hardware degradation. But it may be difficult to detect causes since the tasks still complete successfully. Although it takes more time than the expected time.

Hadoop framework doesn’t try to fix and diagnose slow running task. It tries to detect them and run backup tasks for them. This process is called Speculative execution in Hadoop. These backup tasks are called Speculative tasks in Hadoop.

First of all Hadoop framework launch all the tasks for the job in Hadoop MapReduce. Then it launch speculative tasks for those tasks that have been running for some time (one minute). And the task that have not made any much progress, on average, as compared with other tasks from the job.

If the original task completes before the speculative task. Then it will kill speculative task . On the other hand, it will kill the original task if the speculative task finishes before it.

Read about Speculative Execution in detail.

26) What counter in Hadoop MapReduce?

Counters in MapReduce are useful Channel for gathering statistics about the MapReduce job. Statistics like for quality control or for application-level. They are also useful for problem diagnosis.

Counters validate that:

  • Number of bytes read and write within map/reduce job is correct or not
  • The number of tasks launches and successfully run in map/reduce job is correct or not.
  • The amount of CPU and memory consumed is appropriate for our job and cluster nodes.

There are two types of counters:

  • Built-In Counters – In Hadoop there are some built-In counters for every job. These report various metrics, like, there are counters for the number of bytes and records. Thus, this allows us to confirm that it consume the expected amount of input. Also make sure that it produce the expected amount of output.
  • User-Defined Counters – Hadoop MapReduce permits user code to define a set of counters. These are then increased as desired in the mapper or reducer. For example, in Java, use ‘enum’ to define counters.

Read about Counters in detail.

27) How to submit extra files(jars,static files) for MapReduce job during runtime in Hadoop?

MapReduce framework provides Distributed Cache to caches files needed by the applications. It can cache read-only text files, archives, jar files etc.
An application which needs to use distributed cache to distribute a file should make sure that the files are available on URLs.
URLs can be either hdfs:// or http://.

Now, if the file is present on the hdfs:// or http://urls. Then, user mentions it to be cache file to distribute. This framework will copy the cache file on all the nodes before starting of tasks on those nodes. The files are only copied once per job. Applications should not modify those files.

28) What is TextInputFormat in Hadoop?

TextInputFormat is the default InputFormat. It treats each line of the input file as a separate record. For unformatted data or line-based records like log files, TextInputFormat is useful. By default, RecordReader also uses TextInputFormat for converting data into key-value pairs. So,

  • Key- It is the byte offset of the beginning of the line.
  • Value- It is the contents of the line, excluding line terminators.

File content is- on the top of the building

so,
Key- 0

Value- on the top of the building

TextInputFormat also provides below 2 types of RecordReader- 

  • LineRecordReader
  • SequenceFileRecordReader

29) How many Mappers run for a MapReduce job?

Number of mappers depends on 2 factors:

  • Amount of data we want to process along with block size. It is driven by the number of inputsplit. If we have block size of 128 MB and we expect 10TB of input data, we will have 82,000 maps. Ultimately InputFormat determines the number of maps.
  • Configuration of the slave i.e. number of core and RAM available on slave. The right number of map/node can between 10-100. Hadoop framework should give 1 to 1.5 cores of processor for each mapper. For a 15 core processor, 10 mappers can run.

In MapReduce job, by changing block size we can control number of Mappers . By Changing block size the number of inputsplit increases or decreases.
By using the JobConf’s conf.setNumMapTasks(int num) we can increase the number of map task.

Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Mapper= (1000*1000)/100= 10,000

30) How many Reducers run for a MapReduce job?

With the help of Job.setNumreduceTasks(int) the user set the number of reduces for the job. To set the right number of reducesrs use the below formula:

0.95 Or 1.75 multiplied by (<no. of nodes> * <no. of maximum container per node>).

As the map finishes, all the reducers can launch immediately and start transferring map output with 0.95. With 1.75, faster nodes finsihes first round of reduces and launch second wave of reduces .
With the increase of number of reducers:

  • Load balancing increases.
  • Cost of failures decreases.
  • Framework overhead increases.

31) How to sort intermediate output based on values in MapReduce?

Hadoop MapReduce automatically sorts key-value pair generated by the mapper. Sorting takes place on the basis of keys. Thus, to sort intermediate output based on values we need to use secondary sorting.

There are two possible approaches:

  • First, using reducer, reducer reads and buffers all the value for a given key. Then, do an in-reducer sort on all the values. Reducer will receive all the values for a given key (huge list of values), this cause reducer to run out of memory. Thus, this approach can work well if number of values is small.
  • Second, using MapReduce paradigm. It sort the reducer input values, by creating a composite key” (using value to key conversion approach) . i.e.by adding a part of, or entire value to, the natural key to achieve sorting technique. This approach is scalable and will not generate out of, memory errors.

We need custom partitioner to make that all the data with same key (composite key with the value) . So, data goes to the same reducer and custom comparator. In Custom comparator the data grouped by the natural key once it arrives at the reducer.

32) What is purpose of RecordWriter in Hadoop?

Reducer takes mapper output (intermediate key-value pair) as an input. Then, it runs a reducer function on them to generate output (zero or more key-value pair). So, the output of the reducer is the final output.

RecordWriter writes these output key-value pair from the Reducer phase to output files. OutputFormat determines, how RecordWriter writes these key-value pairs in Output files. Hadoop provides OutputFormat instances which help to write files on the in HDFS or local disk.

33) What are the most common OutputFormat in Hadoop?

Reducer takes mapper output as input and produces output (zero or more key-value pair). RecordWriter writes these output key-value pair from the Reducer phase to output files. So, OutputFormat determines, how RecordWriter writes these key-value pairs in Output files.

FileOutputFormat.setOutputpath() method used to set the output directory. So, every Reducer writes a separate in a common output directory.

Most common OutputFormat are:

  • TextOutputFormat – It is the default OutputFormat in MapReduce. TextOutputFormat writes key-value pairs on individual lines of text files. Keys and values of this format can be of any type. Because TextOutputFormat turns them to string by calling toString() on them.
  • SequenceFileOutputFormat – This OutputFormat writes sequences files for its output. It is also used between MapReduce jobs.
  • SequenceFileAsBinaryOutputFormat – It is another form of SequenceFileInputFormat. which writes keys and values to sequence file in binary format.
  • DBOutputFormat – We use this for writing to relational databases and HBase. Then, it sends the reduce output to a SQL table. It accepts key-value pairs, where the key has a type extending DBwritable.

Read about outputFormat in detail.

34) What is LazyOutputFormat in Hadoop?

FileOutputFormat subclasses will create output files (part-r-nnnn), even if they are empty. Some applications prefer not to create empty files, which is where LazyOutputFormat helps.

LazyOutputFormat is a wrapper OutputFormat. It make sure that the output file should create only when it emit its first record for a given partition.

To use LazyOutputFormat, call its SetOutputFormatClass() method with the JobConf.

To enable LazyOutputFormat, streaming and pipes supports a – lazyOutput option.

35) How to handle record boundaries in Text files or Sequence files in MapReduce InputSplits?

InputSplit’s RecordReader in MapReduce will “start” and “end” at a record boundary.

In SequenceFile, every 2k bytes has a 20 bytes sync mark between the records. And, the sync marks between the records allow the RecordReader to seek to the start of the InputSplit. It contains a file, length and offset. It also find the first sync mark after the start of the split. And, the RecordReader continues processing records until it reaches the first sync mark after the end of the split.

Similarly, Text files use newlines instead of sync marks to handle record boundaries.

36) What are the main configuration parameters in a MapReduce program?

The main configuration parameters are:

  • Input format of data.
  • Job’s input locations in the distributed file system.
  • Output format of data.
  • Job’s output location in the distributed file system.
  • JAR file containing the mapper, reducer and driver classes
  • Class containing the map function.
  • Class containing the reduce function..

37) Is it mandatory to set input and output type/format in MapReduce?

No, it is mandatory.

Hadoop cluster, by default, takes the input and the output format as ‘text’.

TextInputFormat – MapReduce default InputFormat is TextInputFormat. It treats each line of each input file as a separate record and also performs no parsing. For unformatted data or line-based records like log files, TextInputFormat is also useful. By default, RecordReader also uses TextInputFormat for converting data into key-value pairs.

TextOutputFormat- MapReduce default OutputFormat is TextOutputFormat. It also writes (key, value) pairs on individual lines of text files. Its keys and values can be of any type.

38) What is Identity Mapper?

Identity Mapper is the default Mapper provided by Hadoop. When MapReduce program has not defined any mapper class then Identity mapper runs. It simply passes the input key-value pair for the reducer phase. Identity Mapper does not perform computation and calculations on the input data. So, it only writes the input data into output.

The class name is org.apache.hadoop.mapred.lib.IdentityMapper

39) What is Identity reducer?

Identity Reducer is the default Reducer provided by Hadoop. When MapReduce program has not defined any mapper class then Identity mapper runs. It does not mean that the reduce step will not take place. It will take place and related sorting and shuffling will also take place. But there will be no aggregation. So you can use identity reducer if you want to sort your data that is coming from the map but don’t care for any grouping.

40) What is Chain Mapper?

We can use multiple Mapper classes within a single Map task by using Chain Mapper class. The Mapper classes invoked in a chained (or piped) fashion. The output of the first becomes the input of the second, and so on until the last mapper. The Hadoop framework write output of the last mapper to the task’s output.

The key benefit of this feature is that the Mappers in the chain do not need to be aware that they execute in a chain. And, this enables having reusable specialized Mappers. We can combine these mappers to perform composite operations within a single task in Hadoop.

Hadoop framework take Special care when create chains. The key/values output by a Mapper are valid for the following mapper in the chain.

The class name is org.apache.hadoop.mapred.lib.ChainMapper

41) What are the core methods of a Reducer?

Reducer process the output the mapper. After processing the data, it also produces a new set of output, which it stores in HDFS. And, the core methods of a Reducer are:

  • setup()- Various parameters like the input data size, distributed cache, heap size, etc this method configure. Function Definition- public void setup (context)
  • reduce() – Reducer call this method once per key with the associated reduce task. Function Definition- public void reduce (key, value, context)
  • cleanup() – Reducer call this method only once at the end of reduce task for clearing all the temporary files. Function Definition- public void cleanup (context)

42) What are the parameters of mappers and reducers?

The parameters for Mappers are:

  • LongWritable(input)
  • text (input)
  • text (intermediate output)
  • IntWritable (intermediate output)

The parameters for Reducers are:

  • text (intermediate output)
  • IntWritable (intermediate output)
  • text (final output)
  • IntWritable (final output)

43) What is the difference between TextinputFormat and KeyValueTextInputFormat class?

TextInputFormat – It is the default InputFormat. It treats each line of the input file as a separate record. For unformatted data or line-based records like log files, TextInputFormat is also useful. So,

  • Key- It is byte offset of the beginning of the line within the file.
  • Value- It is the contents of the line, excluding line terminators.

KeyValueTextInputFormat – It is like TextInputFormat. The reason is it also treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’). so,

  • Key- Everything up to tab character.
  • Value- Remaining part of the line after tab character.

For example, consider a file contents as below:

AL#Alabama

AR#Arkansas

FL#Florida

So, TextInputFormat
Key      value
0         AL#Alabama 14
AR#Arkansas 23
FL#Florida

So, KeyValueTextInputFormat
Key      value
AL       Alabama
AR       Arkansas
FL        Florida

44) How is the splitting of file invoked in Hadoop ?

InputFormat is responsible for creating InputSplit, which is the logical representation of data. Further Hadoop framework divides split into records. Then, Mapper process each record (which is a key-value pair).

By running getInputSplit() method Hadoop framework invoke Splitting of file . getInputSplit() method belongs to Input Format class (like FileInputFormat) defined by the user.

45) How many InputSplits will be made by hadoop framework?

InputFormat is responsible for creating InputSplit, which is the logical representation of data. Further Hadoop framework divides split into records. Then, Mapper process each record (which is a key-value pair).

MapReduce system use storage locations to place map tasks as close to split’s data as possible. By default, split size is approximately equal to HDFS block size (128 MB).

For, example the file size is 514 MB,
128MB: 1st block, 128Mb: 2nd block, 128Mb: 3rd block,
128Mb: 4th block, 2Mb: 5th block

So, 5 InputSplit is created based on 5 blocks.

If in case you have any confusion about any Hadoop MapReduce Interview Questions, do let us know by leaving a comment. we will be glad to solve your queries.

46) Explain the usage of Context Object.

With the help of Context Object, Mapper can easily interact with other Hadoop systems. It also helps in updating counters. So counters can report the progress and provide any application-level status updates.
It contains configuration details for the job.

47) When is it not recommended to use MapReduce paradigm for large scale data processing?

For iterative processing use cases it is not suggested to use MapReduce. As it is not cost effective, instead Apache Pig can be used for the same.

48) What is the difference between RDBMS with Hadoop MapReduce?

Size of Data

  • RDBMS- Traditional RDBMS can handle upto gigabytes of data.
  • MapReduce- Hadoop MapReduce can hadnle upto petabytes of data or more.

Updates

  • RDBMS- Read and Write multiple times.
  • MapReduce- Read many times but write once model.

Schema

  • RDBMS- Static Schema that needs to be pre-defined.
  • MapReduce- Has a dynamic schema

Processing Model

  • RDBMS- Supports both batch and interactive processing.
  • MapReduce- Supports only batch processing.

Scalability

  • RDBMS- Non-Linear
  • MapReduce- Linear

49) Define Writable data types in Hadoop MapReduce.

Hadoop reads and writes data in a serialized form in the writable interface. The Writable interface has several classes like Text, IntWritable, LongWriatble, FloatWritable, BooleanWritable. Users are also free to define their personal Writable classes as well.

50) Explain what does the conf.setMapper Class do in MapReduce?

Conf.setMapperclass sets the mapper class. Which includes reading data and also generating a key-value pair out of the mapper.

See Also-

Leave a comment

Your email address will not be published. Required fields are marked *