Apache Hadoop is an open-source software framework for distributed storage and distributed processing of extremely large data sets.
There are basically 3 important core components of hadoop –
1. For computational processing i.e. MapReduce: MapReduce is the data processing layer of Hadoop. It is a software framework for easily writing applications that process the vast amount of structured and unstructured data stored in the Hadoop Distributed Filesystem (HSDF). It processes huge amount of data in parallel by dividing the job (submitted job) into a set of independent tasks (sub-job).
In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. The Map is the first phase of processing, where we specify all the complex logic/business rules/costly code. Reduce is the second phase of processing, where we specify light-weight processing like aggregation/summation.
2. For storage purpose i.e., HDFS :Acronym of Hadoop Distributed File System – which is basic motive of storage. It also works as the Master-Slave pattern. In HDFS NameNode acts as a master which stores the metadata of data node and Data node acts as a slave which stores the actual data in local disc parallel.
3.Yarn : which is used for resource allocation.YARN is the processing framework in Hadoop, which provides Resource management, and it allows multiple data processing engines such as real-time streaming, data science and batch processing to handle data stored on a single platform.