This topic contains 1 reply, has 1 voice, and was last updated by  dfbdteam3 12 months ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #4790

    dfbdteam3
    Moderator

    As we know when we copy the data inside hdfs, the files are divided into blocks and blocks are stored distributedly inside hdfs. Who divide these input files ? Is there any algorithm for the same ?

    #4791

    dfbdteam3
    Moderator

    HDFS client will divide the file(s) into Blocks. HDFS client resides on client node. Below is the algorithm shows that how “HDFS client” divide the file(s) into blocks.

    'IHDFileSplit.java:
    
    package com.dataflair.hadoop;
    
    interface IHDFileSplit {
        public void fileSplitAndBlockCreation(String inParamFile, int hdBlockSize);
    }
    
    HDFileSplitClass.java:
    
    package com.dataflair.hadoop;
    
    import java.io.FileInputStream;
    import java.io.FileNotFoundException;
    import java.io.IOException;
    
    public class HDFileSplitClass {
    	public static void main() {
    
    		// I am going with Functional programming style
    		IHDFileSplit hdFileSplit1 = (inParamFile, hdBlockSize) -> {
    			FileInputStream fis = null;
    			try {
    				fis = new FileInputStream(inParamFile);
    			} catch (FileNotFoundException e1) {
    				e1.printStackTrace();
    			}
    
    			/*
    			 * As we are aware that available() method will return total no of
    			 * bytes available in the given file. Now we have to convert these
    			 * Bytes to KB, then KB to MB. This conversation is shown as below
    			 */
    			double mb = 0;
    			try {
    				mb = fis.available() / (1024.0 * 1024.0);
    			} catch (IOException e) {
    				e.printStackTrace();
    			}
    			int numBlocks = 0;
    
    			if (mb >= hdBlockSize) {
    				numBlocks = (int) (mb / hdBlockSize);
    				/*
    				 * Here in num_blocks, we will have an integer value which tells
    				 * that how many blocks we need to create for the given
    				 * hdblock_size)
    				 */
    
    				/*
    				 * Now its times to create a blocks here. Just to visual how to
    				 * create a block, I would use Java NIO Buffer concept which are
    				 * available in Java. ByteBuffer,..etc).
    				 */
    
    			}
    
    			double remainderSize = mb % hdBlockSize;
    
    			if (remainderSize != 0) {
    				// Create a block with remainderSize value
    				;
    			}
    		};
    
    		/*
    		 * Below 128 number is not fixed. Usually this value should be passed
    		 * dynamically. In general, we can specify block size parameter value in
    		 * hdfs-site.xml file. We need to get the configured value from
    		 * hdfs-site.xml and pass this value as a second parameter to the below
    		 * method call.
    		 */
    		hdFileSplit1.fileSplitAndBlockCreation("amazonproductlist.txt", 128);
    	}
    }

    Example: 
    ————-
    Let’s take Input file size as 500.23 MB. And the block size is 128 MB.

    500.23>128 –This is true — enter into if statement,

    numBlocks = (int) 500.23/128
    numBlocks = 3

    Here create 3 blocks each with 128 MB size. Then store the data in the created blocks.

    Next moving onto next statement

    remainderSize = MB % hdBlockSize
    500.23 % 128 = 116.23 MB

    Here create a block with the remainderSize value. As per this logic, Hadoop works intelligently that it never waste any space in allocating complete default block size, if the file size is less than the default or specified block size (As per hdfs-site.xml configuration file).

    One more note is that, if my file is binary data, then InputStream is good. If my file is character data, then Reader is better…etc. Just to show case in the algorithm, I have used InputStream to read the data and know the file size.

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.