How is the splitting of file invoked in Hadoop ?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 5:06 pm #6092
  
  DataFlair Team
  Spectator
  
  How is the splitting of file invoked in Apache Hadoop?
- September 20, 2018 at 5:06 pm #6094
  
  DataFlair Team
  Spectator
  
  An Input File for processing is stored on local HDFS store.
  The InputFormat component of MapReduce task divides this file into Splits.
  These splits are called InputSplits in Hadoop MapReduce.
  
  The InputFormat component actually defines how the Input file will be splitted and also defines in which format the input file will be read.
  
  So the task of InputFormat is as follows:
  
  1. Select the input file for splitting.
  2. Define the Input splits for that file.
  3. Provide a factory for RecordReader objects(which actually read the input file).
  
  There is a class called FileInputFormat which implements the interface InputFormat<K,V> and is the base class for all file-based InputFormats.
  This provides a generic implementation of getSplits(JobConf, int) returns the InputSplit of that file.
  Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.
  
  getSplits in interface InputFormat<K,V>
  
  Parameters:
  job – job configuration.
  numSplits – the desired number of splits
  
  Returns:
  an array of InputSplits for the job.
  
  Thus, splitting of file is invoked by this method in Apache Hadoop.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.