What are the Features of Apache Hadoop?

Viewing 3 reply threads
  • Author
    Posts
    • #6253
      DataFlair TeamDataFlair Team
      Spectator

      What are the characteristics of Hadoop?

    • #6255
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop provides users with a big-data ecosystem, which is capable of handling very large files(ranging from tera bytes to petabytes) by dividing them into smaller blocks(64MB to 128 MB) and storing as well as processing them on different data nodes which are basically different systems.
      The system mainly comprises of 3 components, which are:-
      HDFS(Hadoop Distributed File System)- The file system that is the most reliable, scalable and flexible file system,
      YARN(Yet Another Resource Negotiator)- Works to manage resources for map reduce jobs submitted by user and the data those jobs would be working on,
      MapReduce Framework – This framework enables Hadoop to be a distributed computing system, which essentially means that a map reduces program runs independently on separate nodes and finally aggregates them to a singular result.

      The main characteristics of Hadoop Ecosystem are:-
      1) Hadoop can easily run on commodity hardware(although it is recommended to run master node on a more powerful machine) thus making it economic
      2) Hadoop provides the ability to store data on a distributed network, by dividing the data into smaller blocks and storing them on separate data nodes, the details(metadata) of which is present in the name node or master node.
      3) Since data is stored in a distributed way, it is also processed in a distributed way, by using the mapper jobs of map reduce framework and aggregated by reducer jobs, thus providing parallelism
      4) Hadoop provides a feature called data replication, which means multiple copies of a block is stored in different nodes, thus if any data node fails, the data is easily used from the different node providing fault tolerance.
      5) Hadoop system provides vertical scalability(upgrading individual data nodes) as well as horizontal scalability(addition of new nodes to the cluster), making it a highly scalable.
      6) since hadoop processes large amount of data, it would be very difficult to move data from node to node for computation, where the mapper logic resides, so instead hadoop allows to move the computational logic to the data node which is available and has the required data needed by the mapper.
      7) Apache, cloudera, hortonworks etc provides opensource ecosystem for Hadoop, which means it can be modified as per needed.
      8) Hadoop framework itself takes care of many things like data replication, rack awareness etc making it very easy to use.

      Follow the link for more detail Hadoop Features

    • #6256
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is a framework that allows processing of huge volume of any type of data in highly efficient and most cost-effective manner with its

      • Highly distributed data processing engine – MapReduce
      • Most reliable data distributed storage system – HDFS
      • Resource Management Layer – YARN

      Key features of Apache Hadoop:

      1) Distributed Processing & Storage :
      The framework itself provides great flexibility and manages the distributed processing and distributed storage by itself leaving only the custom logic to be built for data processing by users. That made Apache Hadoop different from other distributed systems and become highly popular so quickly.

      2) Highly Available & Fault Tolerant :
      Hadoop is highly available and provides fault tolerance both in terms of data availability and distributed processing. The data is stored in HDFS where data automatically gets replicated at two other locations. So, even if one or two of the systems collapse, the file is still available on the third system at least. This brings a high level of fault tolerance. The highly distributed MapReduce batch processing engine provides high availability in terms of processing failure due to hardware / machine failure.

      3) Highly & Easily Scalable :
      Both vertical and horizontal scaling is possible. The differentiator however is the horizontal scaling where new nodes can be easily added in the system on the fly as and when data volume of processing needs grow without altering anything in the existing systems or programs.

      4) Data Reliability :
      The data is stored reliably due to data replication in cluster where multiple copies are maintained on different nodes. The framework itself provides mechanisms to ensure data reliability by Block Scanner & Volume Scanner, Directory Scanner and Disk Checker. In case of data corruption and hardware failures the data integrity and availability it maintained.

      5) Robust Ecosystem :
      Hadoop has a very robust ecosystem that is well suited to meet the analytical needs of developers and small to large organizations. Hadoop Ecosystem comes with a suite of tools and technologies making it a very much suitable to deliver to a variety of data processing needs. Just to name a few, Hadoop ecosystem comes with projects such as MapReduce, Yarn, Hive, HBase, Zookeeper, Pig, Flume, Avro etc. and many new tools and technologies are being added to the ecosystem as the market grows.

      6) Very Cost effective :
      Hadoop generates cost benefits by bringing massively parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which in turn makes it reasonable to model all your data.

      7) Open Source :
      No worries for licensing cost with very strong open source community support. Can be easily accommodated and build for custom requirement.

      For more detail follow Hadoop Features

    • #6257
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is a platform for storing and processing huge volumes of data. What makes Hadoop efficient when compared to any other distributed file system is that , it is
      1) Open Source- Easily accessible to any one as there is no license required. There is no vendor lock as it is open source , making it easy to change the vendor if the support is not good.
      2) Not Bounded by Single Schema – Accepts, stores and processes data in any format
      3) Scale out architecture divides workloads across multiple nodes, also nodes can be increased on the fly to improve performance
      4) Economical – Slaves nodes can be deployed on commodity hardware.
      5) Data Locality – The logic to process the data travels to data instead of data itself travelling towards the logic . This reduces high costs associated with data movement
      Replication of data makes the system,
      6) Fault Tolerant – Failure of machine does not affect the system.
      7) Highly Available – data is usually not lost ,even if the system goes down.
      8) Reliable

      For more detail follow Hadoop Features

Viewing 3 reply threads
  • You must be logged in to reply to this topic.