What are the core components of Apache Hadoop?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What are the core components of Apache Hadoop?

Viewing 3 reply threads
  • Author
    Posts
    • #6182
      DataFlair TeamDataFlair Team
      Spectator

      What are the different components of Hadoop Framework?

    • #6184
      DataFlair TeamDataFlair Team
      Spectator

      Two Core Components of Hadoop are:

      1. HDFS: Distributed Data Storage Framework of Hadoop
      2. MapReduce : Distributed Data Processing Framework of Hadoop

      HDFS – is the storage unit of Hadoop, the user can store large datasets into HDFS in a distributed manner. Several replicas of the data block to be distributed across different clusters for data availability.
      HDFS consists of 2 components

      a) Namenode: It acts as the Master node where Metadata is stored to keep track of storage cluster (there is also secondary name node as standby Node for the main Node)
      b) Datanode: it acts as the slave node where actual blocks of data are stored

      MapReduce- It is the processing unit of Hadoop, it is a Java-based system where the actual data from the HDFS store gets processed.The principle of operation behind MapReduce is that the MAP job sends a query for processing data to various nodes and the REDUCE job collects all the results into a single value. Scheduling, monitoring, and re-executes the failed task is taken care by MapReduce.

      Along with HDFS and MapReduce, there are also Hadoop common(provides all Java libraries, utilities and necessary Java files and script to run Hadoop), Hadoop YARN(enables dynamic resource utilization )

      Follow the link to learn more about: Core components of Hadoop

    • #6186
      DataFlair TeamDataFlair Team
      Spectator

      The core components in Hadoop are,

      1. HDFS (Hadoop Distributed File System)
      HDFS is the storage layer of Hadoop which provides storage of very large files across multiple machines. It was derived from Google File System(GFS). HDFS is highly fault tolerant, reliable,scalable and designed to run on low cost commodity hardwares. It divides each file into blocks and stores these blocks in multiple machine.The blocks are replicated for fault tolerance. The block size and replication factor can be specified in HDFS. The default block size and replication factor in HDFS is 64 MB and 3 respectively.
      HDFS works in Master- Slave Architecture. An HDFS cluster consists of Master nodes(Name nodes) and Slave nodes(Data odes). Name node stores metadata about HDFS and is responsible for assigning handling all the data nodes in the cluster. Before Hadoop 2 , the name node was single point of failure in HDFS Cluster. Data nodes store actual data in HDFS. They are responsible for block creation, deletion and replication of the blocks based on the request from name node.

      2.MapReduce
      Map Reduce is the processing layer of Hadoop. It is used to process on large volume of data in parallel. MapReduce splits large data set into independent chunks which are processed parallel by map tasks. The output of the map task is further processed by the reduce jobs to generate the output. The MapReduce works in key – value pair. It has a resource manager on aster node and NodeManager in each data node.

      The other components of Hadoop are,

      1. YARN – YARN stands for Yet Another Resource Negotiator. It is used to manage distributed systems. YARN consists of a central Resource Manager and per node Node Manager.

      2. HIVE- HIVE is a data warehouse infrastructure. It provides an SQL like language called HiveQL.

      3. PIG – Its a platform for analyzing large set of data. It uses MApReduce o execute its data processing.

      4. FLUME – Its used for collecting, aggregating and moving large volumes of data.

      5. Sqoop – Its a system for huge data transfer between HDFS and RDBMS.

      6. Oozie – Its a workflow scheduler for MapReduce jobs.

      7.HBase – Its a non – relational distributed database. It provides random real time access to data.

      Follow the link to learn more about: Core components of Hadoop

    • #6187
      DataFlair TeamDataFlair Team
      Spectator

      Two core components of Hadoop are

      HDFS and MapReduce

      HDFS: HDFS (Hadoop Distributed file system)
      HDFS is storage layer of hadoop, used to store large data set with streaming data access pattern running cluster on commodity hardware. HDFS is world’s most reliable storage of the data.
      It works on master/slave architecture. Where Name node is master and Data node is slave.

      MapReduce
      Map-Reduce is a Programming model for the large volume of data processing in parallel by dividing work into set of independent task. Map-Reduce is also known as computation or processing layer of hadoop. It processes the data in two phases i.e. Map & Reduce. This two phases solves query in HDFS.
      MAP is responsible for reading data from input location and based on the input type it will generate a key/value pair (intermediate output) in local machine.
      Reducer is responsible for processing this intermediate output and generates final output.

      Other components of hadoop ecosystem are:

      YARN (Yet another resource negotiator): YARN is also called as MapReduce2.0.
      Unlike Mapreduce1.0 Job tracker, resource manager and job scheduling/monitoring done in separate daemons.

      Ambari: For Management & Monitoring

      PIG: For Scripting

      HIVE: For Query

      Mahout:Machine Learning

      Ozzie: Workflow & scheduling

      Zookeeper: Coordination

      HBase: No-SQL database

      Sqoop: Data integration

Viewing 3 reply threads
  • You must be logged in to reply to this topic.