Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Hadoop › How indexing is done in HDFS?
September 20, 2018 at 5:23 pm #6218
How indexing is done in Hadoop?September 20, 2018 at 5:23 pm #6220
Hadoop emerged as a solution to the “Big Data” problems. It is an open source software framework for distributed storage and distributed processing of large data sets.
Apache Hadoop has a unique way of Indexing . As, Hadoop framework store the data as per the Data Bock size, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact this is the base of HDFS.September 20, 2018 at 5:24 pm #6221
Hadoop is basically a batch-oriented parallel processing system that was not designed to work interactively. Hadoop emerged as a solution to the “Big Data” problems. It is an open source software framework for distributed storage and distributed processing of large data sets.
Apache Hadoop has a unique way of Indexing . As, Hadoop framework store the data as per the Data Bock size, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact this is the base of HDFS.
HDFS blocks are nothing but simply 128MB cut blocks of the data. Datanodes only knows which blocks they have. The Namenode pieces it all together using an in-memory image of all files and blocks that make these files and where they are stored. The clients get this information from the namenode.September 20, 2018 at 5:24 pm #6222
HDFS is built to run on commodity hardware and provides Fault Tolerance i, resource management, and most importantly, high-throughput access to application data.
This can be achieved very well using the Memory-Based Indexing on HDFS:
Consider a case where data is located in a distributed file system like Hadoop DFS. We cannot directly create an index on the distributed data. In order to do so various steps have to be followed as:
1. Copying of data from HDFS to a local file system,
2. Creating an index of the data present on the local file system, and
3. Finally storing the index files back to HDFS.
The same steps would be required for searches. But this approach is time-consuming and suboptimal, so instead, better option is to search our data using the memory of the HDFS node where data is residing
If we have a data file on HDFS residing inside a working directory, we need to create a folder inside the same working directory of HDFS where all the generated indexes will be stored.
To search the data now, search the indexes stored in HDFS. First, we must make the HDFS index files available in memory for searching. When we have the required index files available in the memory, we can directly perform a search on the index files.
Distributed file systems like HDFS are a powerful tool for storing and accessing the vast amounts of data available to us today. With memory-based indexing and searching, accessing the data you really want to find amid mountains of data you don’t care about gets a little bit easier.September 20, 2018 at 5:24 pm #6224
In Distributed file system like HDFS, indexing is diffenent from that of local file system. Here indexing and searching of data is done using the memory of the HDFS node where data is residing.
The generated index files are stored in a folder in directory where the actual data is residing. Searching is similar to the search in local file system, but RAM directory object is used and is done on the index file residing in the Memory.
Indexing in Hadoop depends on the block size, The last part of the data stored on hdfs will contain the details about the storage of next pat of the Block.
You must be logged in to reply to this topic.