What are the limitations of Apache Spark?

This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 10:25 pm #6466
  
  DataFlair Team
  Spectator
  
  Write the shortcomings of Apache Spark.
  What are the constraints with Apache Spark?
- September 20, 2018 at 10:25 pm #6467
  
  DataFlair Team
  Spectator
  
  Now-a-days,Apache Spark is considered as the next Gen Big data tool that is being widely used by industries. But there are certain limitations of Apache Spark. they are:
  
  Limitations of Apache Spark:
  
  1. No File Management System
  Apache Spark relies on other platforms like Hadoop or some another cloud-based Platform for file management system. This is one of the major issues with Apache Spark.
  
  2. Latency
  While working with Apache Spark, it has higher latency.
  
  3. No support for Real-Time Processing
  In Spark Streaming, the arriving live stream of data is divided into batches of the pre-defined interval, and each batch of data is treated like Spark Resilient Distributed Database (RDDs). Then these RDDs are processed using the operations like map, reduce, join etc. The result of these operations is returned in batches. Thus, it is not real-time processing but Spark is near real-time processing of live data. Micro-batch processing takes place in Spark Streaming.
  
  4. Manual Optimization
  Manual Optimization is required to optimize Spark jobs. Also, it is adequate to specific datasets. we need to control manually if we want to partition and cache in Spark to be correct.
  
  5. Less no. of Algorithm
  Spark MLlib lags behind in terms of a number of available algorithms like Tanimoto distance.
  
  6. Window Criteria
  Spark does not support record based window criteria. It only has time-based window criteria.
  
  7. Iterative Processing
  In Spark, the data iterates in batches and each iteration is scheduled and executed separately.
  
  8. Expensive
  when we want cost-efficient processing of big data In-memory capability can become a bottleneck as keeping data in memory is quite expensive. At that time the memory consumption is very high, and it is not handled in a user-friendly manner. The cost of Spark is quite high because Apache Spark requires lots of RAM to run in-memory.
  
  To know more about limitations of Apache Spark. Refer link: Limitations of Apache Spark – Ways to Overcome Spark Drawbacks
- September 20, 2018 at 10:25 pm #6468
  
  DataFlair Team
  Spectator
  
  1) no real time processing but Spark has near real-time processing of live data.
  2)Its “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
  3)Does not have its file management system, so you need to integrate with hadoop, or other cloud based data platform.
  4)It consumes a lot of Memory,and issues around memory consumption are not handled in a user friendly Manner.
  5)Spark is sky high as the cost of storing large amount of data in-memory is expensive.
  6)Manual optimization is required for correct partitioning and caching of data in Spark.
  7)In Spark, the data iterates in batches and each iteration is scheduled and executed separately.
- September 20, 2018 at 10:25 pm #6469
  
  DataFlair Team
  Spectator
  
  1)Spark is near real-time processing
  2)Transferring the data from any RDBMS to HDFS(vice versa) using spark is not mature way. We can get the data in parallel from RDBMS and represented as Dataframe. It is very rigid way of doing this. We have to provide lower,upper bound, partition column(Don’t thing this partition is same as hive parttitioning. For spark jdbc api, we have give these attributes to read data in parallel from RDMBS to spark dataframe). Sqoop is more mature way of doing this.
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

What are the limitations of Apache Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses