How to handle record boundaries in Text/ Sequence files in MapReduce Inputsplits

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 5:16 pm #6177
  
  DataFlair Team
  Spectator
  
  How to handle record boundaries in Text files or Sequence files in MapReduce InputSplits?
  How we handle record bounderies in Text file in Mapreduce Inputsplits in Hadoop?
- September 20, 2018 at 5:17 pm #6180
  
  DataFlair Team
  Spectator
  
  Hadoop will use your RecordReader and InputFormat to figure out how to 1. create splits and 2. parse data within each split into records (or K/V objects) that can be passed to the mapper. If an InputSplit (which you get to create in your input format) doesn’t map exactly to an HDFS block, Hadoop’s FileInputFormat (and people that extend it) will Do The Right Thing(tm) by performing a partial network read to complete the record using the first few bytes from the next block. The source to TextInputFormat (which in turn extends FileInputFormat) is where all the logic lives.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.