Spark is an open source big data framework. It has an expressive APIs to allow big data professionals to efficiently execute streaming as well as the batch. It provides faster and more general data processing platform engine. It is basically designed for fast computation. It was developed at UC Berkeley in 2009. Spark is an Apache project which is also known as “lighting fast cluster computing“. It distributes data in file system across the cluster, and process that data in parallel. It covers a wide range of workloads like batch applications, iterative algorithms, interactive queries and streaming. It lets you write an application in Java, Python or Scala.
It was developed to overcome the limitations of MapReduce cluster computing paradigm. Spark keeps things in memory whereas map reduce keep shuffling things in and out of disk. It allows to cache data in memory which is beneficial in iterative algorithm those used in machine learning.
Spark is easier to develop as it knows how to operate on data. It supports SQL queries, streaming data as well as graph data processing. Spark doesn’t need Hadoop to run, it can run on its own using other storages like Cassandra, S3 from which spark can read and write. In terms of speed spark run programs up to 100x faster in memory or 10x faster on disk than Map Reduce.