GraphX API in Apache Spark: An Introductory Guide

Boost your career with Free Big Data Courses!!

1. Objective – Spark GraphX API

For graphs and graph-parallel computation, Apache Spark has an additional API, GraphX. In this blog, we will learn the whole concept of GraphX API in Spark. We will also learn how to import Spark and GraphX into the project. Moreover, we will understand the concept of Property Graph. Also, we will cover graph operators and Pregel API in detail. In addition, we will also learn the features of GraphX. Furthermore, we will see the use cases of GraphX API.

So, let’s start the Spark GraphX API tutorial.

Spark GraphX API

GraphX API in Apache Spark: An Introductory Guide

Test how much you know about Spark 

2. Introduction to Spark GraphX API

For graphs and graph-parallel computation, Apache Spark has an additional API, GraphX. To simplify graph analytics tasks it includes the collection of graph algorithms & builders.
In addition, it extends the Spark RDD with a Resilient Distributed Property Graph. Basically, the property graph is a directed multigraph. It has multiple edges in parallel. Here, every vertex and edge have user-defined properties associated with it. Moreover, parallel edges allow multiple relationships between the same vertices.

3. Getting Started with GraphX

You first need to import Spark and GraphX into your project, To get started, code as Follows:
import org.apache.spark._
import org.apache.spark.graphx._
// To make some of the examples work we will also need RDD
import org.apache.spark.rdd.RDD
Note: you will also need a SparkContext if you are not using the Spark shell.

Let’s discuss Spark Structured Streaming

4. The Property Graph

A directed multigraph with user-defined objects attached to each vertex and edge is a property graph. It is a graph with potentially multiple parallel edges. They are also sharing the same source and destination vertex. Where there can be multiple relationships between the same vertices, they are able to support parallel edges. It also simplifies modeling scenarios. By a unique 64-bit long identifier (VertexId) each vertex is keyed. It does not impose any ordering constraints on the vertex identifiers. Hence, edges have corresponding source and destination vertex identifiers.
Basically, it is parameterized over the vertex (VD) and edge (ED) types. Ultimately, these are the types of the objects those are associated with each vertex and edge respectively.
Some more Insights
In addition, it optimizes the representation of vertex and edge types. While they are primitive data types it reduces the in memory footprint. Even by storing them in specialized arrays.
As same as RDDs, property graphs are also immutable, distributed, and fault-tolerant. we can produce a new graph with the desired changes by changing the values or structure of the graph.  We can reuse substantial parts of the original graph in the new graph. It also reduces the cost of this inherently functional data structure. Using a range of vertex partitioning heuristics, a graph is partitioned across the executors. In the event of a failure, each partition of the graph can be recreated on a different machine.

Have a look at Spark machine Learning with R

5. Spark Graph Operators

As same as RDDs basic operations like map, filter, property graphs also have a collection of basic operators.  Those operators take UDFs (user defined functions) and produce new graphs.  Moreover, these are produced with transformed properties and structure. In GraphX, there are some core operators defined that have optimized implementations. While in GraphOps there are some convenient operators that are expressed as compositions of the core operators. Although, Scala operators in GraphOps are automatically available as members of Graph.
Example of Spark Graph Operators
To compute in-degree of each vertex (defined in GraphOps):
val graph1: Graph[(String, String), String]
// Use the implicit GraphOps.inDegrees operator
val inDegrees1: VertexRDD[Int] = graph.inDegrees1
The only difference between core graph operations and GraphOps is different graph representations. In addition, it is must that each graphical representation provides an implementation of the core operations. We can reuse many of the useful GraphOps operations.
There is a set of fundamental operators. For example. Subgraph, joinVertices, and aggregateMessages. Let’s discuss all in detail:

Do you know about SparkR

a. Structural Operators

It supports a simple set of commonly used structural operators. The following is a list of the basic structural operators in Spark.

class Graph[VD, ED] {
def reverse: Graph[VD, ED]
def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
             vpred: (VertexId, VD) => Boolean): Graph[VD, ED]
def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
}

i. The reverse operator

With all the edge directions reversed, it returns a new graph. when trying to compute the inverse PageRank, this can be very useful.

ii. The subgraph operator

It returns the graph containing only the vertices, by taking vertex and edge predicates. We can use subgraph operator to restrict the graph to the vertices and edges of interest. Also to eliminate broken links.

Have a look at Spark stage

iii. The mask operator

It constructs a subgraph by returning a graph that contains the vertices and edges that are also found in the input graph. We can use it in conjunction with the subgraph operator to restrict a graph based on the properties in another related graph.

iv. The groupEdges operator

It merges parallel edges in the multigraph.

b. Join Operators

Sometimes it is the major requirement to join data from external collections (RDDs) with graphs. Like it is possible that we have extra user properties that we want to merge with an existing graph. Also, we might want to pull vertex properties from one graph into another. Hence, we can accomplish it by using the join operators. In addition, there are two Join operators, joinvertices, and Outerjoinvertices.

You must learn Spark executor

class Graph[VD, ED] {
def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD)
  : Graph[VD, ED]
def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])(map: (VertexId, VD, Option[U]) => VD2)
  : Graph[VD2, ED]
}

c. Aggregate Messages (aggregateMessages)

In GraphX, the core aggregation operation is aggregateMessages. It applies a user-defined sendMsg function to each edge triplet in the graph. Afterwards, it uses the mergeMsg function to aggregate those messages at their destination vertex.

class Graph[VD, ED] {
def aggregateMessages[Msg: ClassTag](
    sendMsg: EdgeContext[VD, ED, Msg] => Unit,
    mergeMsg: (Msg, Msg) => Msg,
    tripletFields: TripletFields = TripletFields.All)
  : VertexRDD[Msg]
}

6. Pregel API

Since properties of vertices depend on properties of their neighbors, Graphs are inherently recursive data structures. That, in turn, depend on properties of their neighbors. As a matter of fact, many important graph algorithms iteratively recompute the properties of each vertex. It recomputes until it reaches a fixed-point condition. Basically, To express these iterative algorithms it proposes a graph-parallel abstraction. Hence, there is a variant of the Pregel API, exposed by GraphX.

7. Spark GraphX Features

The features of Spark GraphX are as follows:

a. Flexibility

We can work with both graphs and computations with Spark GraphX. It includes exploratory analysis, ETL (Extract, Transform & Load), the iterative graph in 1 system. It is possible to view the same data as both graphs, collections, transform and join graphs with RDDs. Also using the Pregel API it is possible to write custom iterative graph algorithms.

b. Speed

It offers comparable performance to the fastest specialized graph processing systems. Also comparable with the fastest graph systems. Even while retaining Spark’s flexibility, fault tolerance and ease of use.

c. Growing Algorithm Library

Spark GraphX offers growing library of graph algorithms. Basically, We can choose from it. For example pagerank, connected components, SVD++, strongly connected components and triangle count.

8. Conclusion – GraphX API in Spark

As a result, we have learned the whole concept of GraphX API. Moreover, we have covered the brief introduction of Operators and Variant, Pregel API. Also, we have seen how GraphX API in Apache Spark: An Introductory Guide simplifies graph analytics tasks in Spark. Hence, GraphX API is very useful for graphs and graph-parallel computation in Spark.

See also – 

Spark Quiz

Reference for Spark

If you are Happy with DataFlair, do not forget to make us happy with your positive feedback on Google

follow dataflair on YouTube

1 Response

  1. rainbowr says:

    Good information for understand easy

Leave a Reply

Your email address will not be published. Required fields are marked *