This topic contains 3 replies, has 1 voice, and was last updated by  dfbdteam5 11 months ago.

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
  • #6466


    Write the shortcomings of Apache Spark.
    What are the constraints with Apache Spark?



    Now-a-days,Apache Spark is considered as the next Gen Big data tool that is being widely used by industries. But there are certain limitations of Apache Spark. they are:

    Limitations of Apache Spark:

    1. No File Management System
    Apache Spark relies on other platforms like Hadoop or some another cloud-based Platform for file management system. This is one of the major issues with Apache Spark.

    2. Latency
    While working with Apache Spark, it has higher latency.

    3. No support for Real-Time Processing
    In Spark Streaming, the arriving live stream of data is divided into batches of the pre-defined interval, and each batch of data is treated like Spark Resilient Distributed Database (RDDs). Then these RDDs are processed using the operations like map, reduce, join etc. The result of these operations is returned in batches. Thus, it is not real-time processing but Spark is near real-time processing of live data. Micro-batch processing takes place in Spark Streaming.

    4. Manual Optimization
    Manual Optimization is required to optimize Spark jobs. Also, it is adequate to specific datasets. we need to control manually if we want to partition and cache in Spark to be correct.

    5. Less no. of Algorithm
    Spark MLlib lags behind in terms of a number of available algorithms like Tanimoto distance.

    6. Window Criteria
    Spark does not support record based window criteria. It only has time-based window criteria.

    7. Iterative Processing
    In Spark, the data iterates in batches and each iteration is scheduled and executed separately.

    8. Expensive
    when we want cost-efficient processing of big data In-memory capability can become a bottleneck as keeping data in memory is quite expensive. At that time the memory consumption is very high, and it is not handled in a user-friendly manner. The cost of Spark is quite high because Apache Spark requires lots of RAM to run in-memory.

    To know more about limitations of Apache Spark. Refer link: Limitations of Apache Spark – Ways to Overcome Spark Drawbacks



    1) no real time processing but Spark has near real-time processing of live data.
    2)Its “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
    3)Does not have its file management system, so you need to integrate with hadoop, or other cloud based data platform.
    4)It consumes a lot of Memory,and issues around memory consumption are not handled in a user friendly Manner.
    5)Spark is sky high as the cost of storing large amount of data in-memory is expensive.
    6)Manual optimization is required for correct partitioning and caching of data in Spark.
    7)In Spark, the data iterates in batches and each iteration is scheduled and executed separately.



    1)Spark is near real-time processing
    2)Transferring the data from any RDBMS to HDFS(vice versa) using spark is not mature way. We can get the data in parallel from RDBMS and represented as Dataframe. It is very rigid way of doing this. We have to provide lower,upper bound, partition column(Don’t thing this partition is same as hive parttitioning. For spark jdbc api, we have give these attributes to read data in parallel from RDMBS to spark dataframe). Sqoop is more mature way of doing this.

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.