Who divide the file into block while storing inside hdfs

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts

September 20, 2018 at 12:16 pm #4790

DataFlair Team
Spectator

As we know when we copy the data inside hdfs, the files are divided into blocks and blocks are stored distributedly inside hdfs. Who divide these input files ? Is there any algorithm for the same ?

September 20, 2018 at 12:16 pm #4791

Spectator

HDFS client will divide the file(s) into Blocks. HDFS client resides on client node. Below is the algorithm shows that how “HDFS client” divide the file(s) into blocks.

'IHDFileSplit.java:

package com.dataflair.hadoop;

interface IHDFileSplit {
    public void fileSplitAndBlockCreation(String inParamFile, int hdBlockSize);
}

HDFileSplitClass.java:

package com.dataflair.hadoop;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

public class HDFileSplitClass {
	public static void main() {

		// I am going with Functional programming style
		IHDFileSplit hdFileSplit1 = (inParamFile, hdBlockSize) -> {
			FileInputStream fis = null;
			try {
				fis = new FileInputStream(inParamFile);
			} catch (FileNotFoundException e1) {
				e1.printStackTrace();
			}

			/*
			 * As we are aware that available() method will return total no of
			 * bytes available in the given file. Now we have to convert these
			 * Bytes to KB, then KB to MB. This conversation is shown as below
			 */
			double mb = 0;
			try {
				mb = fis.available() / (1024.0 * 1024.0);
			} catch (IOException e) {
				e.printStackTrace();
			}
			int numBlocks = 0;

			if (mb >= hdBlockSize) {
				numBlocks = (int) (mb / hdBlockSize);
				/*
				 * Here in num_blocks, we will have an integer value which tells
				 * that how many blocks we need to create for the given
				 * hdblock_size)
				 */

				/*
				 * Now its times to create a blocks here. Just to visual how to
				 * create a block, I would use Java NIO Buffer concept which are
				 * available in Java. ByteBuffer,..etc).
				 */

			}

			double remainderSize = mb % hdBlockSize;

			if (remainderSize != 0) {
				// Create a block with remainderSize value
				;
			}
		};

		/*
		 * Below 128 number is not fixed. Usually this value should be passed
		 * dynamically. In general, we can specify block size parameter value in
		 * hdfs-site.xml file. We need to get the configured value from
		 * hdfs-site.xml and pass this value as a second parameter to the below
		 * method call.
		 */
		hdFileSplit1.fileSplitAndBlockCreation("amazonproductlist.txt", 128);
	}
}

Example:
————-
Let’s take Input file size as 500.23 MB. And the block size is 128 MB.

500.23>128 –This is true — enter into if statement,

numBlocks = (int) 500.23/128
numBlocks = 3

Here create 3 blocks each with 128 MB size. Then store the data in the created blocks.

Next moving onto next statement

remainderSize = MB % hdBlockSize
500.23 % 128 = 116.23 MB

Here create a block with the remainderSize value. As per this logic, Hadoop works intelligently that it never waste any space in allocating complete default block size, if the file size is less than the default or specified block size (As per hdfs-site.xml configuration file).

One more note is that, if my file is binary data, then InputStream is good. If my file is character data, then Reader is better…etc. Just to show case in the algorithm, I have used InputStream to read the data and know the file size.

Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

Who divide the file into block while storing inside hdfs

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses