The Hadoop Distributed File System- HDFS is a distributed file system. Hadoop is mainly designed for batch processing of large volume of data. The default Data Block size of HDFS is 128 MB. When file size is significantly smaller than the block size the efficiency degrades.
Mainly there are two reasons for producing small files:
Files could be the piece of a larger logical file. Since HDFS has only recently supported appends, these unbounded files are saved by writing them in chunks into HDFS.
Another reason is some files cannot be combined together into one larger file and are essentially small. e.g. – A large corpus of images where each image is a distinct file.
The small size problem is 2 folds. 1) Small File problem in HDFS:
Storing lot of small files which are extremely smaller than the block size cannot be efficiently handled by HDFS. Reading through small files involve lots of seeks and lots of hopping between data node to data node, which is inturn inefficient data processing.
In namenode’s memory, every file, directory, and the block in HDFS is represented as an object. Each of these objects is in the size of 150 bytes. If we consider 10 Million small files, each of these files will be using a separate block. That will cause to a use of 3 gigabytes of memory. With the hardware limitations have, scaling up beyond this level is a problem. With a lot of files, the memory required to store the metadata is high and can not scale beyond a limit.
In MapReduce, Map task process a block of data at a time. Many small files mean lots of blocks – which means lots of tasks, and lots of book keeping by Application Master. This will slow the overall cluster performance compared to large files processing.
1. HAR files
2. SequenceFile System
3. Hbase (If latency is not an issue) and other options as well.