What is Apache Hadoop?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 2:56 pm #5353
  
  DataFlair Team
  Spectator
  
  What is Hadoop Framework?
  What is Hadoop used for?
  What is Hadoop and what is it used for?
- September 20, 2018 at 2:56 pm #5355
  DataFlair Team
  Spectator
  Hadoop emerged as a solution to the “Big Data”problems. It is a part of the Apache project sponsored by the Apache Software Foundation (ASF). Hadoop is an open-source software framework for distributed storage and distributed processing of extremely large data sets. Open source means it is freely available and even its source code can be changed as per our requirements.
  Hadoop makes it possible to run applications on the system with thousands of commodity hardware nodes, and to handle thousands of terabytes of data. It’s distributed file system has the provision of rapid data transfer rates among nodes and allows the system to continue operating in case of node failure.
  Apache Hadoop provides storage layer- HDFS, a batch processing engine- MapReduce and a Resource Management Layer- YARN.
  
  Why Hadoop?
  Hadoop is not only a storage system but is a platform for data storage as well as processing. It is scalable (more nodes can be added on the fly), Fault Tolerant(Even if nodes go down, data can be processed by another node) and Open source (can modify the source code if required).
  
  Following characteristics of Hadoop make is a unique platform:
  - Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured. It is not bounded by a single schema.
  - Excels at processing data of complex nature, its scale-out architecture divides workloads across multiple nodes. Another added advantage is that its flexible file-system eliminates ETL bottlenecks.
  - Scales economically, as discussed it can be deployed on commodity hardware. Apart from this its open-source nature guards against vendor lock.
  Follow the link for more detail: Hadoop Tutorial
- September 20, 2018 at 2:57 pm #5356
  
  DataFlair Team
  Spectator
  
  Hadoop is an open-source software framework used for distributed storage and distributed processing of datasets of big data across clusters built from commodity hardware.
  Core of hadoop consists of two main components:
  HDFS (Hadoop distributed file system) andMapReduce.
  
  1) HDFS- It is the distributed storage part, where each file is split into 64mb or 128mb blocks and spread across the cluster.
  
  2) MapReduce- It is the distributed parallel processing part, where the stored data is processed parallelly across the cluster.
  
  Here, the code travels to the nodes where the data is present and gets executed there itself, which accounts for Data Locality. Hence the processing is fast.
  
  Hadoop can be scalable from a single node to thousands of nodes in a cluster.
  
  The framework consists of other projects like hive, sqoop, flume, hbase, etc.
  
  Follow the link for more detail: Hadoop Tutorial
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What is Apache Hadoop?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses