How to decide input split?

Viewing 1 reply thread
  • Author
    Posts
    • #4892
      DataFlair TeamDataFlair Team
      Spectator

      how input splits decided ? on what basis input splits are created?

    • #4893
      DataFlair TeamDataFlair Team
      Spectator

      InputSplit in Hadoop

      In Hadoop MapReduce, InputSplit is the logical representation of data. Basically, it represents a unit of work which consists of single map task in a MapReduce program.

      Moreover, the data which is processed by an individual Mapper, Hadoop InputSplit represents it. Basically, the split is divided into several records. So, each record is processed by the mapper (which is a key-value pair).

      Furthermore, there are storage locations (hostname strings) of every InputSplit and in order to place map tasks as close to split’s data as possible, MapReduce systems use those storage locations.

      InputSplits created by an InputFormat (InputFormat creates the Inputsplit and divide into records), that means we don’t need to deal with InputSplit directly, as a user. By default, fileInputFormat, breaks a file into 128MB chunks (same as blocks in HDFS). So, we can control this value or by overriding the parameter in the Job object used to submit a particular MapReduce job, by setting mapred.min.split.size parameter in mapred-site.xml. Also, by writing a custom InputFormat, we can control how the file is broken up into splits.

      Learn more about Inputspli, follow the link: InputSplit in Hadoop MapReduce

Viewing 1 reply thread
  • You must be logged in to reply to this topic.