GraphX API in Apache Spark: An Introductory Guide

1. Objective

For graphs and graph-parallel computation, Apache Spark has an additional API, GraphX. In this blog, we will learn the whole concept of GraphX API in Spark. We will also learn how to import Spark and GraphX into the project. Moreover, we will understand the concept of Property Graph. Also, we will cover graph operators and Pregel API in detail. In addition, we will also learn features of GraphX. Furthermore, we will also see Use cases of GraphX API.

Spark GraphX API

2. Introduction to Spark GraphX API

For graphs and graph-parallel computation, Apache Spark has an additional API, GraphX. To simplify graph analytics tasks it includes the collection of graph algorithms & builders.

In addition, it extends the Spark RDD with a Resilient Distributed Property Graph. Basically, the property graph is a directed multigraph. It has multiple edges in parallel. Here, every vertex and edge have user-defined properties associated with it. Moreover, parallel edges allow multiple relationships between the same vertices.

3. Getting Started with GraphX

You first need to import Spark and GraphX into your project, To get started, code as Follows:

import org.apache.spark._

import org.apache.spark.graphx._

// To make some of the examples work we will also need RDD

import org.apache.spark.rdd.RDD

Note: you will also need a SparkContext if you are not using the Spark shell.

4. The Property Graph

A directed multigraph with user-defined objects attached to each vertex and edge is a property graph. It is a graph with potentially multiple parallel edges. They are also sharing the same source and destination vertex. Where there can be multiple relationships between the same vertices, they are able to support parallel edges. It also simplifies modeling scenarios. By a unique 64-bit long identifier (VertexId) each vertex is keyed. It does not impose any ordering constraints on the vertex identifiers. Hence, edges have corresponding source and destination vertex identifiers.

Basically, it is parameterized over the vertex (VD) and edge (ED) types. Ultimately, these are the types of the objects those are associated with each vertex and edge respectively.

Some more Insights

In addition, it optimizes the representation of vertex and edge types. While they are primitive data types it reduces the in memory footprint. Even by storing them in specialized arrays.

As same as RDDs, property graphs are also immutable, distributed, and fault-tolerant. we can produce a new graph with the desired changes by changing the values or structure of the graph.  We can reuse substantial parts of the original graph in the new graph. It also reduces the cost of this inherently functional data structure. Using a range of vertex partitioning heuristics, a graph is partitioned across the executors. In the event of a failure, each partition of the graph can be recreated on a different machine.

5. Spark Graph Operators

As same as RDDs basic operations like map, filter, property graphs also have a collection of basic operators.  Those operators take UDFs (user defined functions) and produce new graphs.  Moreover, these are produced with transformed properties and structure. In GraphX, there are some core operators defined that have optimized implementations. While in GraphOps there are some convenient operators that are expressed as compositions of the core operators. Although, Scala operators in GraphOps are automatically available as members of Graph.

For Example

To compute in-degree of each vertex (defined in GraphOps):

val graph1: Graph[(String, String), String]

// Use the implicit GraphOps.inDegrees operator

val inDegrees1: VertexRDD[Int] = graph.inDegrees1

The only difference between core graph operations and GraphOps is different graph representations. In addition, it is must that each graphical representation provides an implementation of the core operations. We can reuse many of the useful GraphOps operations.

There is a set of fundamental operators. For example. Subgraph, joinVertices, and aggregateMessages. Let’s discuss all in detail:

a. Structural Operators

It supports a simple set of commonly used structural operators. The following is a list of the basic structural operators in Spark.

class Graph[VD, ED] {

def reverse: Graph[VD, ED]

def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,

             vpred: (VertexId, VD) => Boolean): Graph[VD, ED]

def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]

def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]


i. The reverse operator

With all the edge directions reversed, it returns a new graph. when trying to compute the inverse PageRank, this can be very useful.

ii. The subgraph operator

It returns the graph containing only the vertices, by taking vertex and edge predicates. We can use subgraph operator to restrict the graph to the vertices and edges of interest. Also to eliminate broken links.

iii. The mask operator

It constructs a subgraph by returning a graph that contains the vertices and edges that are also found in the input graph. We can use it in conjunction with the subgraph operator to restrict a graph based on the properties in another related graph.

iv. The groupEdges operator

It merges parallel edges in the multigraph.

b. Join Operators

Sometimes it is the major requirement to join data from external collections (RDDs) with graphs. Like it is possible that we have extra user properties that we want to merge with an existing graph. Also, we might want to pull vertex properties from one graph into another. Hence, we can accomplish it by using the join operators. In addition, there are two Join operators, joinvertices, and Outerjoinvertices.

class Graph[VD, ED] {

def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD)

  : Graph[VD, ED]

def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])(map: (VertexId, VD, Option[U]) => VD2)

  : Graph[VD2, ED]


c. Aggregate Messages (aggregateMessages)

In GraphX, the core aggregation operation is aggregateMessages. It applies a user-defined sendMsg function to each edge triplet in the graph. Afterwards, it uses the mergeMsg function to aggregate those messages at their destination vertex.

class Graph[VD, ED] {

def aggregateMessages[Msg: ClassTag](

    sendMsg: EdgeContext[VD, ED, Msg] => Unit,

    mergeMsg: (Msg, Msg) => Msg,

    tripletFields: TripletFields = TripletFields.All)

  : VertexRDD[Msg]


6. Pregel API

Since properties of vertices depend on properties of their neighbors, Graphs are inherently recursive data structures. That, in turn, depend on properties of their neighbors. As a matter of fact, many important graph algorithms iteratively recompute the properties of each vertex. It recomputes until it reaches a fixed-point condition. Basically, To express these iterative algorithms it proposes a graph-parallel abstraction. Hence, there is a variant of the Pregel API, exposed by GraphX.

7. Spark GraphX Features

The features of Spark GraphX are as follows:

a. Flexibility

We can work with both graphs and computations with Spark GraphX. It includes exploratory analysis, ETL (Extract, Transform & Load), the iterative graph in 1 system. It is possible to view the same data as both graphs, collections, transform and join graphs with RDDs. Also using the Pregel API it is possible to write custom iterative graph algorithms.

b. Speed

It offers comparable performance to the fastest specialized graph processing systems. Also comparable with the fastest graph systems. Even while retaining Spark’s flexibility, fault tolerance and ease of use.

c. Growing Algorithm Library

Spark GraphX offers growing library of graph algorithms. Basically, We can choose from it. For example pagerank, connected components, SVD++, strongly connected components and triangle count.

8. Conclusion

As a result, we have learned the whole concept of GraphX API. Moreover, we have covered the brief introduction of Operators and Variant, Pregel API. Also, we have seen how GraphX API in Apache Spark: An Introductory Guide simplifies graph analytics tasks in Spark. Hence, GraphX API is very useful for graphs and graph-parallel computation in Spark.