Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Apache Spark › What are the limitations of Apache Spark?
September 20, 2018 at 10:25 pm #6466
Write the shortcomings of Apache Spark.
What are the constraints with Apache Spark?
September 20, 2018 at 10:25 pm #6467
Limitations of Apache Spark:
1. No File Management System
Apache Spark relies on other platforms like Hadoop or some another cloud-based Platform for file management system. This is one of the major issues with Apache Spark.
While working with Apache Spark, it has higher latency.
3. No support for Real-Time Processing
In Spark Streaming, the arriving live stream of data is divided into batches of the pre-defined interval, and each batch of data is treated like Spark Resilient Distributed Database (RDDs). Then these RDDs are processed using the operations like map, reduce, join etc. The result of these operations is returned in batches. Thus, it is not real-time processing but Spark is near real-time processing of live data. Micro-batch processing takes place in Spark Streaming.
4. Manual Optimization
Manual Optimization is required to optimize Spark jobs. Also, it is adequate to specific datasets. we need to control manually if we want to partition and cache in Spark to be correct.
5. Less no. of Algorithm
Spark MLlib lags behind in terms of a number of available algorithms like Tanimoto distance.
6. Window Criteria
Spark does not support record based window criteria. It only has time-based window criteria.
7. Iterative Processing
In Spark, the data iterates in batches and each iteration is scheduled and executed separately.
when we want cost-efficient processing of big data In-memory capability can become a bottleneck as keeping data in memory is quite expensive. At that time the memory consumption is very high, and it is not handled in a user-friendly manner. The cost of Spark is quite high because Apache Spark requires lots of RAM to run in-memory.
To know more about limitations of Apache Spark. Refer link: Limitations of Apache Spark – Ways to Overcome Spark Drawbacks
September 20, 2018 at 10:25 pm #6468
1) no real time processing but Spark has near real-time processing of live data.
2)Its “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
3)Does not have its file management system, so you need to integrate with hadoop, or other cloud based data platform.
4)It consumes a lot of Memory,and issues around memory consumption are not handled in a user friendly Manner.
5)Spark is sky high as the cost of storing large amount of data in-memory is expensive.
6)Manual optimization is required for correct partitioning and caching of data in Spark.
7)In Spark, the data iterates in batches and each iteration is scheduled and executed separately.
September 20, 2018 at 10:25 pm #6469
1)Spark is near real-time processing
2)Transferring the data from any RDBMS to HDFS(vice versa) using spark is not mature way. We can get the data in parallel from RDBMS and represented as Dataframe. It is very rigid way of doing this. We have to provide lower,upper bound, partition column(Don’t thing this partition is same as hive parttitioning. For spark jdbc api, we have give these attributes to read data in parallel from RDMBS to spark dataframe). Sqoop is more mature way of doing this.
- You must be logged in to reply to this topic.