How is the splitting of file invoked in Hadoop ?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop How is the splitting of file invoked in Hadoop ?

Viewing 1 reply thread
  • Author
    Posts
    • #6092
      DataFlair TeamDataFlair Team
      Spectator

      How is the splitting of file invoked in Apache Hadoop?

    • #6094
      DataFlair TeamDataFlair Team
      Spectator

      An Input File for processing is stored on local HDFS store.
      The InputFormat component of MapReduce task divides this file into Splits.
      These splits are called InputSplits in Hadoop MapReduce.

      The InputFormat component actually defines how the Input file will be splitted and also defines in which format the input file will be read.

      So the task of InputFormat is as follows:

      1. Select the input file for splitting.
      2. Define the Input splits for that file.
      3. Provide a factory for RecordReader objects(which actually read the input file).

      There is a class called FileInputFormat which implements the interface InputFormat<K,V> and is the base class for all file-based InputFormats.
      This provides a generic implementation of getSplits(JobConf, int) returns the InputSplit of that file.
      Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.

      getSplits in interface InputFormat<K,V>

      Parameters:
      job – job configuration.
      numSplits – the desired number of splits

      Returns:
      an array of InputSplits for the job.

      Thus, splitting of file is invoked by this method in Apache Hadoop.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.