Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Apache Spark › How is data represented in Spark?
September 20, 2018 at 11:19 pm #6482
what are the different ways of representing data in Spark?
In how many ways can we represent data in Spark?
September 20, 2018 at 11:19 pm #6483
There are 3 ways to represent data in Apache Spark:
RDD, DataFrame, and Dataset
RDD (Resilient Distributed Dataset): It is the fundamental data structure of Apache Spark. It is an immutable collection of object. These object computes on diferent nodes of the cluster. There are three ways to create an RDD: by parallelizing already existing collection in dataset, from other RDD and from data in stable storage. RDD also provides two types of operation namely Transformation and Action.
DataFrame: These are the dataset that are arranged in named columns. These are relational data items with good optimization technique. Although Spark DataFrame is above RDD, it possesses all the features of an RDD. DataFrames are ahead of RDD as it provides memory management plan.
Dataset: It is strongly typed data structure in SparkSQL. It maps to relational schema. Dataset gives the benefits of both type safety and Object oriented programming interface. Dataset clubs the property of both DataFrame and RDD. And thus, provides better functional programming interface.
September 20, 2018 at 11:19 pm #6484
There are mainly three ways data can be represented in the Spark
RDD: RDD is the primary user-facing API in Spark. At the core, an RDD is an immutable distributed collection of elements of the data, partitioned across nodes in the cluster that can be operated in parallel with a low-level API that offers transformations and actions.
DataFrame: Unlike an RDD, in DataFrame data is organized in named columns, just like a table in a relational database. It is designed to make large data sets processing even easier, DataFrame allows developers to impose a structure on a distributed collection of data, that allows higher-level abstraction. DataFrame provides a domain specific language API to manipulate the distributed data.
Dataset: Dataset takes two different APIs characteristics: a strongly-typed API and an untyped API. Conceptually, consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, is strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.
September 20, 2018 at 11:20 pm #6485
There are 4 ways to represent data in Spark:
RDD, DataFrame, Dataset and the latest being GraphFrame.
RDD (Resilient Distributed Dataset) : It is the fundamental data structure of Apache Spark and provides core abstraction. It is a collection of immutable objects which computes on different nodes of the cluster. It is resilient as well as lazy in nature apart from being statically typed.
Disadvantage : When Spark distributes the data within cluster, it does so using Java serialization by default. The overhead of serializing Java and Scala objects is expensive and requires sending both data and structure between nodes (object contains class structure and values). There is also the overhead of garbage collection that results from creating and destroying individual objects.
RDDs are the lowest level data structures in Spark. RDDs and their transformations describe how to compute (similar to Map-Reduce) but doesn’t show what it does directly (similar to SQL). Hence, no optimization is performed on RDD operations. Each operation is executed where it appears in the DAG (i.e. the DAG is not optimized)
DataFrame : It is the immutable and distributed collection of data organised into named columns. Data frames do not run directly on spark context but on the SQL context. It provides the concept of schema to describe the data, allowing Spark to manage the schema and only pass the data between nodes in a more efficient way than Java serialization. There are also advantages when performing computations in a single process as Spark can serialize the data into off-heap storage in a binary format and then perform any transformations directly on this off-heap memory, avoiding the garbage collection costs associated with constructing individual objects for each row in a data set. It takes the advantage of Catalyst to optimise their query plan and achieve better performance than RDD.
Disadvantage: The main drawback is the lack of type safety. DataFrames have an associated schema representing the data. But the schema just holds the column names and not the column types. Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. The user will have to cast the values to the expected type. If attribute names are incorrect than the error will only be detected at runtime.
Dataset: Data set eliminates the drawbacks of DataFrame by adding type safety to it. It runs on the SQL context and provides a similar syntax as that of RDD with lambda expressions. Since data set knows which data type it holds, it can provide the hints to generate encoders to save and operate on data in tungsten format (Tungsten is the optimised memory engine used by Spark which manages direct access to off-heap memory to improve performance).
GraphFrame : GraphFrames are dedicated to graph storage and manipulation. The graph data is stored into two distinct frames : 1) Graph Vertices or Vertex DataFrame and 2) Graph Edges or Edge DataFRame
In addition to DataFrame API it provides graph operations like Breadth first search, shortest path, page rank etc.
- You must be logged in to reply to this topic.