What are the limitations of Apache Spark?

Viewing 3 reply threads
  • Author
    • #6466
      DataFlair Team

      Write the shortcomings of Apache Spark.
      What are the constraints with Apache Spark?

    • #6467
      DataFlair Team

      Now-a-days,Apache Spark is considered as the next Gen Big data tool that is being widely used by industries. But there are certain limitations of Apache Spark. they are:

      Limitations of Apache Spark:

      1. No File Management System
      Apache Spark relies on other platforms like Hadoop or some another cloud-based Platform for file management system. This is one of the major issues with Apache Spark.

      2. Latency
      While working with Apache Spark, it has higher latency.

      3. No support for Real-Time Processing
      In Spark Streaming, the arriving live stream of data is divided into batches of the pre-defined interval, and each batch of data is treated like Spark Resilient Distributed Database (RDDs). Then these RDDs are processed using the operations like map, reduce, join etc. The result of these operations is returned in batches. Thus, it is not real-time processing but Spark is near real-time processing of live data. Micro-batch processing takes place in Spark Streaming.

      4. Manual Optimization
      Manual Optimization is required to optimize Spark jobs. Also, it is adequate to specific datasets. we need to control manually if we want to partition and cache in Spark to be correct.

      5. Less no. of Algorithm
      Spark MLlib lags behind in terms of a number of available algorithms like Tanimoto distance.

      6. Window Criteria
      Spark does not support record based window criteria. It only has time-based window criteria.

      7. Iterative Processing
      In Spark, the data iterates in batches and each iteration is scheduled and executed separately.

      8. Expensive
      when we want cost-efficient processing of big data In-memory capability can become a bottleneck as keeping data in memory is quite expensive. At that time the memory consumption is very high, and it is not handled in a user-friendly manner. The cost of Spark is quite high because Apache Spark requires lots of RAM to run in-memory.

      To know more about limitations of Apache Spark. Refer link: Limitations of Apache Spark – Ways to Overcome Spark Drawbacks

    • #6468
      DataFlair Team

      1) no real time processing but Spark has near real-time processing of live data.
      2)Its “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
      3)Does not have its file management system, so you need to integrate with hadoop, or other cloud based data platform.
      4)It consumes a lot of Memory,and issues around memory consumption are not handled in a user friendly Manner.
      5)Spark is sky high as the cost of storing large amount of data in-memory is expensive.
      6)Manual optimization is required for correct partitioning and caching of data in Spark.
      7)In Spark, the data iterates in batches and each iteration is scheduled and executed separately.

    • #6469
      DataFlair Team

      1)Spark is near real-time processing
      2)Transferring the data from any RDBMS to HDFS(vice versa) using spark is not mature way. We can get the data in parallel from RDBMS and represented as Dataframe. It is very rigid way of doing this. We have to provide lower,upper bound, partition column(Don’t thing this partition is same as hive parttitioning. For spark jdbc api, we have give these attributes to read data in parallel from RDMBS to spark dataframe). Sqoop is more mature way of doing this.

Viewing 3 reply threads
  • You must be logged in to reply to this topic.