This topic contains 2 replies, has 1 voice, and was last updated by  dfbdteam3 1 year, 8 months ago.

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
  • #6343


    Why do we need Apache Hadoop?



    Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

    Many organizations use Hadoop for handling Big Data mainly because of it’s ability to store, manage and analyze vast amounts of structured, semi-structured and unstructured data quickly, reliably, flexibly and at a very low-cost.
    There are many reasons to believe that Hadoop is the best software for Big Data analytics. Some of the important benefits are:

    Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media (Facebook, Twitter, Walmart, etc), online video streaming platform (YouTube, Daily Motion,etc), gadgets (smartphones, smartwatches,etc) and many more varieties, is a key consideration in Hadoop to process large volumes of data in a short period of time.
    High computing power. Hadoop’s distributed computing model processes big data fast. The more computing nodes we use in the Hadoop cluster, the more processing power we will have.
    Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes in the cluster to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically (Data replication across multiple data blocks).
    Flexibility. Unlike traditional relational databases, you don’t have to pre-process data before storing it. We can store as much data as we want and decide how to use at a later stage. That includes unstructured data like text, images and videos.
    Low cost. Since Hadoop is an open-source framework, it is free and uses commodity hardware to store large quantities of data, which is very cost effective
    Scalability. We can easily grow our system to handle more data simply by adding nodes. Very minimal administration is required.
    Hadoop consists of three key components:

    HDFS – HDFS is the storage layer of Hadoop.
    MapReduce – The data processing layer of Hadoop.
    YARN – The resource management layer of Hadoop.
    All the above three key components are highly responsible for Hadoop’s key benefits to function. Many large web companies such as Google, Yahoo, Amazon and Facebook,etc., have used Hadoop over huge data sets, creating innovative data products such as online advertising systems and recommendation engines.

    Hadoop provides a platform with which enterprise IT can apply Big Data analysis and overcome various business problems such as product recommendation, fraud detection, sentiment analysis and many more such use cases, thus making Hadoop the most important software to be used for Big Data analytics.



    In Big Data, Hadoop is the solution, so much so that the open source data platform has become practically synonymous with the wildly popular term for storing and analyzing huge sets of information.

    Now lets understand how exactly Hadoop does it?
    As we know before hadoop , storage was expensive, so for Hadoop this became the biggest motivator in the market.
    Hadoop, lets you store as much data as you want in whatever form you need, simply by adding more servers/node to a Hadoop cluster.
    Each new server (which can be commodity hardware i.e x86 machines with relatively small price tags) which will add more storage and more processing power to the overall cluster. This makes data storage with Hadoop far less costly than prior methods of data storage.

    So in Hadoop we’re not talking about data storage in terms of archiving it, that’s just putting data onto tape. Companies need to store increasingly large amounts of data(volume) and be able to easily get to it for a wide variety of purposes. So to do this with traditional system was costly and too complex.

    So what exactly hadoop offers to store?
    We can store various datasets(variety) like emails, search results, sales data, inventory data, customer data etc, all this data coming at really fast pace(velocity) and trying to manage it all in a relational database management system (RDBMS) will make hole in pockets.

    Expensive RDBMS-based storage also led to data being siloed within an organization. Sales had its data, marketing had its data, accounting had its own data and so on. Worse, each department may have down-sampled its data based on its own assumptions. That can make it very difficult (and misleading) to use the data for company-wide decisions.

    Hadoop lets companies afford to hold all of their data, not just the down-sampled portions. Fixed assumptions don’t need to be made in advance. All data becomes equal and equally available, so business scenarios can be run with raw data at any time as needed, without limitation or assumption. This is a very big deal, because if no data needs to be thrown away, any data model a company might want to try becomes fair game.

    Hadoop also lets companies store data as it comes in – structured or unstructured – so you don’t have to spend money and time configuring data for relational databases and their rigid tables. Since Hadoop can scale so easily, it can also be the perfect platform to catch all the data coming from multiple sources at once.

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.