This topic contains 4 replies, has 1 voice, and was last updated by  dfbdteam3 1 year, 2 months ago.

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
    Posts
  • #6218

    dfbdteam3
    Moderator

    How indexing is done in Hadoop?

    #6220

    dfbdteam3
    Moderator

    Hadoop emerged as a solution to the “Big Data” problems. It is an open source software framework for distributed storage and distributed processing of large data sets.
    Apache Hadoop has a unique way of Indexing . As, Hadoop framework store the data as per the Data Bock size, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact this is the base of HDFS.

    #6221

    dfbdteam3
    Moderator

    Hadoop is basically a batch-oriented parallel processing system that was not designed to work interactively. Hadoop emerged as a solution to the “Big Data” problems. It is an open source software framework for distributed storage and distributed processing of large data sets.
    Apache Hadoop has a unique way of Indexing . As, Hadoop framework store the data as per the Data Bock size, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact this is the base of HDFS.

    HDFS blocks are nothing but simply 128MB cut blocks of the data. Datanodes only knows which blocks they have. The Namenode pieces it all together using an in-memory image of all files and blocks that make these files and where they are stored. The clients get this information from the namenode.

    #6222

    dfbdteam3
    Moderator

    HDFS is built to run on commodity hardware and provides Fault Tolerance i, resource management, and most importantly, high-throughput access to application data.

    This can be achieved very well using the Memory-Based Indexing on HDFS:

    Consider a case where data is located in a distributed file system like Hadoop DFS. We cannot directly create an index on the distributed data. In order to do so various steps have to be followed as:
    1. Copying of data from HDFS to a local file system,
    2. Creating an index of the data present on the local file system, and
    3. Finally storing the index files back to HDFS.

    The same steps would be required for searches. But this approach is time-consuming and suboptimal, so instead, better option is to search our data using the memory of the HDFS node where data is residing

    If we have a data file on HDFS residing inside a working directory, we need to create a folder inside the same working directory of HDFS where all the generated indexes will be stored.

    To search the data now, search the indexes stored in HDFS. First, we must make the HDFS index files available in memory for searching. When we have the required index files available in the memory, we can directly perform a search on the index files.

    Distributed file systems like HDFS are a powerful tool for storing and accessing the vast amounts of data available to us today. With memory-based indexing and searching, accessing the data you really want to find amid mountains of data you don’t care about gets a little bit easier.

    #6224

    dfbdteam3
    Moderator

    In Distributed file system like HDFS, indexing is diffenent from that of local file system. Here indexing and searching of data is done using the memory of the HDFS node where data is residing.

    The generated index files are stored in a folder in directory where the actual data is residing. Searching is similar to the search in local file system, but RAM directory object is used and is done on the index file residing in the Memory.

    Indexing in Hadoop depends on the block size, The last part of the data stored on hdfs will contain the details about the storage of next pat of the Block.

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.