What are the disadvantages of Hadoop?

Viewing 4 reply threads
  • Author
    Posts
    • #5864
      DataFlair TeamDataFlair Team
      Spectator

      What are the limitations of Hadoop?
      What are the Shortcomings in Hadoop?

    • #5866
      DataFlair TeamDataFlair Team
      Spectator

      1. Small file problem: The files which are smaller than the Block size of the HDFS is considered as small files,
      Problems with small file processing
      Memory of Name node: If the large number of small files are stored in blocks, the namenode should read metadata of every file during the startup, thus causing delay.The name node should track the details of each block from every datanode by the heartbeat signal sent from the datanode to the namenode, large number of blocks imply large network bandwidth usage.
      2. MapReduce performance issue: Large number of small files means a large amount of random disk IO, thus affecting Performance.

      3. Security Concerns: By default, the security of Hadoop is disabled, also there is missing encryption at the storage and network levels, this might be a problem for certain firms where security is a main concern. Moreover, the entire Hadoop framework is written on Java which is the most exploited language.

      4. Issues with Big Data Analytics: very limited SQL functions such as subqueries, ‘group by’ analytics, etc. HDFS has no notion of a query optimizer. However certain Hadoop ecosystems like Hive, Mahout helps to overcome these kind of issues.

      5. Processing speed: Map and Reduce task takes time, it only supports batch processing. These drawbacks are overcome by Spark and Flink, where in-memory , stream processing of data takes place and also supports iterative.

      For more detail follow: Limitation of Hadoop

    • #5867
      DataFlair TeamDataFlair Team
      Spectator

      Disadvantages or Limitations of Hadoop :

      Though Hadoop is very reliable & Fault-Tolerance framework to process & store large data sets in a distributed computing environment it has some limitations such as

      1) Issue with dealing with large number of small files –
      1.1. In this case, there is huge disk seeks due to multiple read / write operations.
      1.2. When there are large number of small size files then load on NameNode increases and since name node stores metadata which is generally 150 KB for each file it eats up lot of time in storing MetaData when file size is smaller. This deteriorates the namenodes’ performance
      1.3. One Mapper process 1 Block. If one block is 10KB only then for Mapper actual processing time taken is more as JVM start is required for every mapper processing. Hence smaller files increase overahead for every mapper.

      2) Slow processing speed as data is stored distributedly
      2.1. In Hadoop data is stored distributedly. Data is distributed and processed over the cluster in MapReduce which increases the time and reduces processing speed.
      2.2. Output of the each mapper which is an intermediate output is written on local disk which reduces performance.

      3) Hadoop is unable to process streaming data aka Fast Data and real time data as it supports batch process only.
      3.1. Hadoop supports batch processing only, it does not process streamed data
      3.2. MapReduce processing style is suitable for the requirements that are mostly static and you can wait for batch-mode processing. However most of the applications need to do analytics on streaming data, like from sensors on a factory floor, or have applications that require multiple operations such as machine-learning algorithms can’t wait to process the data in batch mode.

      4) Inefficient in iterative processing –
      4.1. MapReduce uses coarse-grained tasks to do its work, which are too heavyweight for iterative algorithms.
      4.2. For sites like Twitter, Linked-in etc. has data which is nicely modeled as a graph. Most graph algorithms traverse the graph one connection per iteration. Since there is a growing need to get answers faster Hadoop’s performance lackluster due to its inefficiency in iterative processing.
      4.3. Mapper part of MapReduce flushes intermediate data to disk between each step. Combined, these sources of overhead make algorithms requiring many fast steps unacceptably slow such as Graph Analysis.

      5) Latency issue i.e. Delay in processing- MapReduce requires a lot of time to perform these tasks thereby increasing latency.

      6) No interactive mode- MapReduce is not easy to use as MapReduce developers need to hand code for each and every operation which makes it very difficult to work. MapReduce has no interactive mode, but adding one such as hive and pig makes working with MapReduce a little easier for adopters.

      7) Security issue- No encryption hence poses big risk to data. HDFSsupports access control lists (ACLs) and a traditional file permissions model. However it can be used in collaboration with third party tools such as Active Directory Kerberos and LDAP for authentication.

      8) In Hadoop, Mapreduce doesn’t have interactive mode for interemediate feedback for queries. Where as Spark has interactive mode so that developers and users alike can have intermediate feedback for queries and other action.

      9) No Abstraction- This makes MapReduce developers to hand code for each and every operation. On the other hand Spark has RDD abstraction & Flink has Dataset abstraction.

      10) Vulnerability- Since Hadoop is based on Java which is widely exploited by cyber criminals which can cause security breaches.

      11) No Cashing- In MapReduce intermediate output is written to local disk. In Hadoop, MapReduce cannot cache the intermediate data in memory hence it diminishes the performance.

      12) Line of code- Hadoop 2.0 has 1,20,000 line of codes Vs Spark has only 20000 lines of code. More lines of code makes Hadoop bulky compared to Spark.

      13) Uncertain Timelines- Hadoop only ensures that data job is complete, but it does not guarantee when the job will be completed.

      For more detail follow: Limitation of Hadoop

    • #5869
      DataFlair TeamDataFlair Team
      Spectator

      1.Complexity : Hadoop is a collection of big data tools and it is not easy to setup and maintain and write programs.Map reduce is heavy and it requires the knowledge of Java. This has been overcome with the new frameworks like Hive and Pig.Also we need Spark for performing tasks quickly. Real Time processing of data requires deeply integrating several components.

      2. Secured Hadoop Environment: Hadoop security model is disabled by default. So the person in charge of the hadoop environment doesn’t know how to enable it then the sensitive data stored is at risk.Things are simple if you are using only HDFS and map reduce. File system permissions are enough for authorization. But things get complicated once we decide to use other frameworks. It is also missing encryption at the storage and network levels.This is also a major concern for the government security agencies. The map reduce is written almost entirely in Java, an open source and it is always being modified to enable new features as well as opens door for security threats.

      3. Not fit for small data : Hadoop has a high capacity design, the HDFS lacks the ability to efficiently support the reading of small quantity of data.

      4. Multiple copies of already big data: Even though the big data is being divided into blocks for processing, by default three copies of each block will be available across the datanodes. This is needed for maintaining the performance but it adds up to the hardware required.

      5. Very limited SQL functions: The other frameworks which helps to make hadoop a queryable data warehouse lacks basic sql functions like Subqueries, group by etc.

      For more detail follow: Limitation of Hadoop

    • #5873
      DataFlair TeamDataFlair Team
      Spectator

      1) Not fit for small data– Hadoop is not suited for small data execution and does not efficiently supports random reading of small files.
      HDFS is designed to work with a small number of large files for storing datasets but not for a large number of small files.

      2) Processing Speed – As MapReduce process large amount of datasets, the processing takes a huge amount of time because data is distributed across the cluster which in turn reduces processing speed.

      3) Efficiency – Hadoop cannot cache the intermediate processing output in memory which affects the efficiency.

      4) Encryption level – Hadoop is missing encryption at storage and network levels. Hadoop supports the authentication of Kerberos. It is typically hard to manage.

      5) Very limited SQL functions – The other frameworks which help to make Hadoop a queryable data warehouse lacks basic SQL functions like Subqueries, group by etc.

      Learn more about: Limitations of Hadoop

Viewing 4 reply threads
  • You must be logged in to reply to this topic.