Apache Spark Map vs FlatMap Operation 1   Recently updated !


1. Objective

In this Apache Spark tutorial, we will discuss the comparison between Spark Map vs FlatMap Operation. Map and FlatMap are the transformation operations in Spark. Map() operation applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. While FlatMap() is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function.

In this blog, we will discuss how to perform map operation on RDD and how to process data using FlatMap operation. This tutorial also covers what is map operation, what is a flatMap operation, the difference between map() and flatMap() transformation in Apache Spark with examples. We will also see Spark map and flatMap example in Scala and Java in this Spark tutorial.

 

Comparison between Apache Spark map() vs flatMap() operations with Scala and Java examples.

Before starting to Map and FlatMap function, follow this guide to install and configure Apache Spark to practically implement these spark transformation operations on RDD.

2. Difference between Spark Map vs FlatMap Operation

This section of the Spark tutorial provides the details of Map vs FlatMap operation in Apache Spark with examples in Scala and Java programming languages.

2.1. Spark Map Transformation

Apache Spark map transformation operation

A map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD.

Spark Map function takes one element as input process it according to custom code (specified by the developer) and returns one element at a time. Map transforms an RDD of length N into another RDD of length N. The input and output RDDs will typically have the same number of records.

2.1.1. Map Transformation Scala Example

a. Create RDD

 val data = spark.read.textFile("INPUT-PATH").rdd 

Above statement will create an RDD with name data. Follow this guide to learn more ways to create RDDs in Apache Spark.

b. Map Transformation-1

 val newData = data.map (line => line.toUpperCase() ) 

Above the map, a transformation will convert each and every record of RDD to upper case.

c. Map Transformation-2

 val tag = data.map {line => {
 val xml = XML.loadString(line)
 xml.attribute("Tags").get.toString()
 }
} 

Above the map, a transformation will parse XML and collect Tag attribute from the XML data. Overall the map operation is converting XML into a structured format. Follow this guide to learn more Spark Transformation operations.

2.1.2. Map Transformation Java Example

a. Create RDD

 JavaRDD<String> linesRDD = spark.read().textFile("INPUT-PATH").javaRDD(); 

Above statement will create an RDD with name linesRDD.

b. Map Transformation

JavaRDD<String> newData = linesRDD.map(new Function<String, String>() {

public String call(String s) {
String result = s.trim().toUpperCase();
return result;
}
});

2.2. Spark FlatMap Transformation Operation

Let’s now discuss flatMap() operation in Apache Spark-

Apache Spark flatMap transformation operation

A flatMap is a transformation operation. It applies to each element of RDD and it returns the result as new RDD. It is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function. In the FlatMap operation, a developer can define his own custom business logic. The same logic will be applied to all the elements of the RDD.

A FlatMap function takes one element as input process it according to custom code (specified by the developer) and returns 0 or more element at a time. flatMap() transforms an RDD of length N into another RDD of length M.

2.2.1. FlatMap Transformation Scala Example

 val result = data.flatMap (line => line.split(" ") ) 

Above flatMap transformation will convert a line into words. One word will be an individual element of the newly created RDD.

2.2.2. FlatMap Transformation Java Example

 JavaRDD<String> result = data.flatMap(new FlatMapFunction<String, String>() {
public Iterator<String> call(String s) {
return Arrays.asList(s.split(" ")).iterator();
 } });

Above flatMap transformation will convert a line into words. One word will be an individual element of the newly created RDD.

3. Conclusion

Hence, from the comparison between Spark map() vs flatMap(), it is clear that Spark map function expresses a one-to-one transformation. It transforms each element of a collection into one element of the resulting collection. While Spark flatMap function expresses a one-to-many transformation. It transforms each element to 0 or more elements.

Please leave a comment if you like this post or have any query about Apache Spark map vs flatMap function.

See Also-


Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Apache Spark Map vs FlatMap Operation