How indexing is done in Hadoop?

Viewing 2 reply threads
  • Author
    Posts
    • #6276
      DataFlair TeamDataFlair Team
      Spectator

      How indexing is done in HDFS?

    • #6277
      DataFlair TeamDataFlair Team
      Spectator

      In Distributed file system like HDFS, indexing is diffenent from that of local file system. Here indexing and searching of data is done using the memory of the HDFS node where data is residing.

      The generated index files are stored in a folder in directory where the actual data is residing. Searching is similar to the search in local file system, but RAM directory object is used and is done on the index file residing in the Memory.

      Indexing in Hadoop depends on the block size, The last part of the data stored on hdfs will contain the details about the storage of next pat of the Block.

    • #6279
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop stores data in files, and does not index them. To find something, we have to run a MapReduce job going through all the data. Hadoop is efficient where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can’t easily index changing data.

      However, we can use indexing in HDFS using two types viz. file based indexing & InputSplit based indexing.

      Lets assume that we have 2 files to store in HDFS for processing. First one is of 500 MB and 2nd one is around 250 MB. Hence we’ll have 4 InputSplits of around 128MB each on 1st File and 2 InputSplits on 2nd file.

      We can apply 2 types of indexing for the mentioned case namely –
      1. With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query
      2. With InputSplit based indexing, you will end up with 4 InputSplits. The performance should be definitely better than doing a full scan query.

      Now, to for implementing InputSplits index we need to perform following steps:

      1. Build index from your full data set- This can be achived by writing a MapReduce job to extract the value we want to index, and output it together with its InputSplit MD5 hash.
      2. Get the InputSplit(s) for the indexed value you are looking for- Output of MapReduce program will be Reduced Files (Containing Indices based on InputSplits) which will be stored in HDFS
      3. Execute your actual MapReduce job on indexed InputSplits only- This can be done by Hadoop as it is able to retrieve the number of InputSplit to be used using the FileInputFormat.class. We will create our own IndexFileInputFormat class extending the default FileInputFormat.class, and overriding its getSplits() method. You have to read the file you have created at previous step, add all your indexed InputSplits into a list, and then compare this list with the one returned by the super class. You will return to JobTracker only the InputSplits that were found in your index.
      4. In Driver class we have now to use this IndexFileInputFormat class. We need to set as InputFormatClass using –
      To Use our custom IndexFileInputFormat In Driver class we need to provide
      job.setInputFormatClass(IndexFileInputFormat.class);

Viewing 2 reply threads
  • You must be logged in to reply to this topic.