what is Hadoop?

Viewing 3 reply threads
  • Author
    Posts
    • #5790
      DataFlair TeamDataFlair Team
      Spectator

      What is Hadoop?
      Is it just a database like Oracle or DB2, or it is another NoSQL?
      Is it just a distributed framework for data processing? What’s new in Hadoop as compare to older distributed processing frameworks like SAN?

    • #5791
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is not a database, it is an open source framework. It is having its own Storage system which is called as HDFS and for processing huge files it is having a framework which is called as MapReduce .

      1) HDFS [Hadoop Distributed File System]: It is the most reliable, fault tolerant, highly available storage system available. It stores the data distributively. It is highly scalable at very low cost. It can be stored on commodity hardware.

      2) MapReduce: It is a programming model design for processing a large amount of data in parallel on commodity hardware.
      It basically divides the task into smaller independent parts, each part runs in parallel on commodity hardware.

      Unlike RDBMS and another traditional database, Hadoop takes time for processing and doesn’t give immediate results. But it is far better than just databases.

      SAN is a Storage Area Network is a specialized, high-speed networks that allow each server to access shared storage.

      Hadoop runs on local disks, but it can also run well in a shared SAN environment for small to medium sized clusters with different coast and performance characteristics.

      Follow the link to learn more about Apache Hadoop

    • #5792
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment.The distributed storage of data allows distributed and parallel processing of data.

      Hadoop vs RDBMS:

      Hadoop could be good if you have a lot of data you do not know what to do and you do not want to loose it.  Hadoop does not require any modeling and is very forgiving for people who do not understand their data.  In a RDBMS you must always, always, always, always  model your data. There are many more differences.

      Hadoop refers to an ecosystem of software packages, including MapReduce , HDFS , and a whole host of other software packages to support the import and export of data into and from HDFS (the Hadoop Distributed FileSystem). When someone says, “I have a Hadoop cluster,” they generally mean a cluster of machines all running in this general ecosystem with a large distributed filesystem to support large scale computation.

      NoSQL is referring to non-relational or at least non-SQL database solutions such as HBase (also a part of the Hadoop ecosystem), Cassandra, MongoDB, Riak, CouchDB, and many others.
      Hadoop – computing framework
      NoSQL – Not Only – SQL databases

      HDFS vs SAN

      HDFS provides extreme low cost per byte, Very high bandwidth to support MapReduce workloads and Rock solid data reliability

      Follow the link to learn more about Apache Hadoop

    • #5793
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop can be thought of as a set of open source programs and procedures (meaning essentially they are free for anyone to use or modify, with a few exceptions) which anyone can use as the “backbone” of their big data operations.

      The 4 main Modules of Hadoop
      1. Distributed File-System
      The most important two are the Distributed File System, which allows data to be stored in an easily accessible format, across a large number of linked storage devices, and the MapReduce – which provides the basic tools for poking around in the data.

      (A “file system” is the method used by a computer to store data, so it can be found and used. Normally this is determined by the computer’s operating system, however a Hadoop system uses its own file system which sits “above” the file system of the host computer – meaning it can be accessed using any computer running any supported OS).

      2. MapReduce
      MapReduce is named after the two basic operations this module carries out – reading data from the database, putting it into a format suitable for analysis (map), and performing mathematical operations.

      3. Hadoop Common
      The other module is Hadoop Common, which provides the tools (in Java) needed for the user’s computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.

      4. YARN
      The final module is YARN, which manages resources of the systems storing the data and running the analysis.

      RDBMS works on well-structured data in a table in small or medium scale with defined database schema in ER model. It is well appropriate for real-time OLTP processing. However, RDBMS architecture with ER model is unable to deliver fast results with vertical scalability by adding CPU or more storages. It becomes unreliable if the main server is down.
      On the other hand, Hadoop system manages effectively with both large-sized structured and unstructured data in different formats such as XML, JSON, text at high fault-error tolerance. With clusters of many servers in horizontal scalability, Hadoop system’s performance is superior. It provides faster results from big data, unstructured data.

      Hadoop also have plugin modules which are NoSql like HBASE.

      You can run Hadoop with a SAN(storage area network), but you shouldn’t. Using traditional enterprise storage like SAN essentially means giving up on data locality. One of the big promises of Hadoop is that the (smaller) work code moves to the (larger) data.
      With a SAN, this is no longer the case. Data is now moving through the network, with price point determining how fast that data moves.

Viewing 3 reply threads
  • You must be logged in to reply to this topic.