{"id":1830,"date":"2017-03-30T12:20:58","date_gmt":"2017-03-30T12:20:58","guid":{"rendered":"http:\/\/data-flair.training\/blogs\/?p=1830"},"modified":"2019-12-11T16:38:36","modified_gmt":"2019-12-11T11:08:36","slug":"create-rdds-in-apache-spark","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/","title":{"rendered":"How to Create RDDs in Apache Spark?"},"content":{"rendered":"<h2>1. Objective<\/h2>\n<p>In this <a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-introduction-comprehensive-tutorial-guide-beginners\/\"><strong>Spark tutorial<\/strong><\/a>, we are going to understand different ways of how to create RDDs in Apache Spark. We will understand Spark RDDs and 3 ways of creating RDDs in Spark &#8211; Using parallelized collection, from existing Apache Spark RDDs and from external datasets. We will also understand Spark RDDs creation with examples to get in-depth knowledge of how to create RDDs in Spark.<\/p>\n<p><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-73667 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2.jpg\" alt=\"how to create RDDs in spark \" width=\"802\" height=\"420\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2.jpg 802w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2-768x402.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2-520x272.jpg 520w\" sizes=\"auto, (max-width: 802px) 100vw, 802px\" \/><\/a><\/p>\n<h2>2. How to Create RDDs in Apache Spark?<\/h2>\n<p><a href=\"http:\/\/data-flair.training\/blogs\/resilient-distributed-datasets-rdd-apache-spark\/\"><strong>Resilient Distributed Datasets (RDD)<\/strong><\/a> is the fundamental data structure of Spark. RDDs are immutable and fault tolerant in nature. These are distributed collections of objects. The datasets are divided into a logical partition, which is further computed on different nodes over the cluster. Thus, RDD is just the way of representing dataset distributed across multiple machines, which can be operated around in parallel. RDDs are called resilient because they have the ability to always re-compute an RDD. Let us revise Spark RDDs in depth here.<br \/>\nNow as we have already seen what is RDD in Spark, let us see how to create Spark RDDs.<br \/>\nThere are three ways to create an RDD in Spark.<\/p>\n<ul>\n<li>Parallelizing already existing collection in driver program.<\/li>\n<li>Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system).<\/li>\n<li>Creating RDD from already existing RDDs.<\/li>\n<\/ul>\n<p><strong>Learn: <a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-persistence-caching\/\">RDD Persistence and Caching Mechanism in Apache Spark<\/a><\/strong><br \/>\nLet us learn these in details below:<\/p>\n<h3>i. Parallelized collection (parallelizing)<\/h3>\n<p>In the initial stage when we learn Spark, <strong><a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-tutorial\/\">RDDs<\/a><\/strong> are generally created by parallelized collection i.e. by taking an existing collection in the program and passing it to<a href=\"http:\/\/data-flair.training\/blogs\/learn-apache-spark-sparkcontext\/\"><strong> SparkContext\u2019s<\/strong><\/a> parallelize() method. This method is used in the initial stage of learning Spark since it quickly creates our own RDDs in Spark shell and performs operations on them. This method is rarely used outside testing and prototyping because this method requires entire dataset on one machine.<br \/>\nConsider the following example of sortByKey(). In this, the data to be sorted is taken through parallelized collection:<br \/>\n[php]val data=spark.sparkContext.parallelize(Seq((&#8220;maths&#8221;,52),(&#8220;english&#8221;,75),(&#8220;science&#8221;,82), (&#8220;computer&#8221;,65),(&#8220;maths&#8221;,85)))<br \/>\nval sorted = data.sortByKey()<br \/>\nsorted.foreach(println)[\/php]<br \/>\nThe key point to note in parallelized collection is the number of partition the dataset is cut into. Spark will run one task for each partition of cluster. We require two to four partitions for each CPU in <strong><a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-cluster-managers-tutorial\/\">cluster<\/a><\/strong>. Spark sets number of partition based on our cluster. But we can also manually set the number of partitions. This is achieved by passing number of partition as second parameter to parallelize .<br \/>\ne.g. sc.parallelize(data, 10), here we have manually given number of partition as 10.<br \/>\nConsider one more example, here we have used parallelized collection and manually given the number of partitions:<br \/>\n[php]val rdd1 = spark.sparkContext.parallelize(Array(&#8220;jan&#8221;,&#8221;feb&#8221;,&#8221;mar&#8221;,&#8221;april&#8221;,&#8221;may&#8221;,&#8221;jun&#8221;),3)<br \/>\nval result = rdd1.coalesce(2)<br \/>\nresult.foreach(println)[\/php]<\/p>\n<h3>ii. External Datasets (Referencing a dataset)<\/h3>\n<p>In Spark, the distributed dataset can be formed from any data source supported by Hadoop, including the local file system, <strong><a href=\"http:\/\/data-flair.training\/blogs\/comprehensive-hdfs-guide-introduction-architecture-data-read-write-tutorial\/\">HDFS<\/a><\/strong>, Cassandra, <strong><a href=\"http:\/\/data-flair.training\/blogs\/hbase-tutorial-beginners-guide\/\">HBase <\/a><\/strong>etc. In this, the data is loaded from the external dataset. To create text file RDD, we can use SparkContext\u2019s textFile method. It takes URL of the file and read it as a collection of line. URL can be a local path on the machine or a hdfs:\/\/, s3n:\/\/, etc.<br \/>\nThe point to jot down is that the path of the local file system and worker node should be the same. The file should be present at same destinations both in the local file system and worker node.\u00a0 We can copy the file to the worker\u00a0nodes or use a network mounted shared file system.<br \/>\n<strong>Learn:<\/strong><a href=\"https:\/\/data-flair.training\/blogs\/scala-spark-shell-commands\/\"><strong> Spark Shell Commands to Interact with Spark-<\/strong>Scala<\/a><br \/>\nDataFrameReader Interface is used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc). Use SparkSession.read to access an instance of DataFrameReader.DataFrameReader supports many file formats-<\/p>\n<h4><strong>a.\u00a0<\/strong>csv<strong> (String path)<\/strong><\/h4>\n<p>It loads a CSV file and returns the result as a Dataset&lt;Row&gt;.<br \/>\n<strong>Example:<\/strong><br \/>\n[php]import org.apache.spark.sql.SparkSession<br \/>\ndef main(args: Array[String]):Unit = {<br \/>\nobject DataFormat {<br \/>\nval spark =\u00a0 SparkSession.builder.appName(&#8220;AvgAnsTime&#8221;).master(&#8220;local&#8221;).getOrCreate()<br \/>\nval dataRDD = spark.read.csv(&#8220;path\/of\/csv\/file&#8221;).rdd[\/php]<br \/>\n<strong>Note<\/strong> &#8211; Here .rdd method is used to convert Dataset&lt;Row&gt; to RDD&lt;Row&gt;.<\/p>\n<h4>b. json (String path)<\/h4>\n<p>It loads a JSON file (one object per line) and returns the result as a Dataset&lt;Row&gt;<br \/>\n[php]val dataRDD = spark.read.json(&#8220;path\/of\/json\/file&#8221;).rdd[\/php]<\/p>\n<h4>c. textFile (String path)<\/h4>\n<p>It loads text files and returns a Dataset of String.<br \/>\n[php]val dataRDD = spark.read.textFile(&#8220;path\/of\/text\/file&#8221;).rdd[\/php]<br \/>\n<strong>Learn:<a href=\"https:\/\/data-flair.training\/blogs\/rdd-lineage\/\"> RDD lineage in Spark: ToDebugString Method<\/a><\/strong><\/p>\n<h3>iii. Creating RDD from existing RDD<\/h3>\n<p>Transformation mutates one RDD into another RDD, thus transformation is the way to create an RDD from already existing RDD. This creates difference between <a href=\"http:\/\/data-flair.training\/blogs\/comparison-between-apache-spark-vs-hadoop-mapreduce\/\"><strong>Apache Spark and Hadoop MapReduce<\/strong>. <\/a>Transformation acts as a function that intakes an RDD and produces one. The input RDD does not get changed, because RDDs are immutable in nature but it produces one or more RDD by applying operations. Some of the operations applied on RDD are: filter, count, distinct,<strong> <a href=\"http:\/\/data-flair.training\/blogs\/comparison-between-map-vs-flatmap-operation-spark\/\">Map, FlatMap<\/a><\/strong> etc.<br \/>\n<strong>Example:<\/strong><br \/>\n[php]val words=spark.sparkContext.parallelize(Seq(&#8220;the&#8221;, &#8220;quick&#8221;, &#8220;brown&#8221;, &#8220;fox&#8221;, &#8220;jumps&#8221;, &#8220;over&#8221;, &#8220;the&#8221;, &#8220;lazy&#8221;, &#8220;dog&#8221;))<br \/>\nval wordPair = words.map(w =&gt; (w.charAt(0), w))<br \/>\nwordPair.foreach(println)[\/php]<br \/>\n<strong>Note<\/strong> &#8211; In above code RDD &#8220;wordPair&#8221; is created from existing RDD &#8220;word&#8221; using map() transformation which contains word and its starting character together.<br \/>\n<strong>Read: <a href=\"https:\/\/data-flair.training\/blogs\/rdd-lineage\/\">Limitations of Spark RDD<\/a><\/strong><\/p>\n<h2>4. Conclusion<\/h2>\n<p>Now as we have seen how to create RDDs in Apache Spark, let us learn <strong><a href=\"http:\/\/data-flair.training\/blogs\/rdd-transformations-actions-apis-apache-spark\/\">RDD\u00a0transformations and Actions in Apache Spark<\/a><\/strong>\u00a0with the help of examples. If you like this blog or you have any query to create RDDs in Apache Spark, so let us know by leaving a comment in the comment box.<br \/>\n<strong>Reference:<\/strong><br \/>\n<strong><a href=\"http:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/spark.apache.org\/<\/a><\/strong><br \/>\n<strong>See Also-<\/strong><\/p>\n<ul>\n<li><strong><a href=\"http:\/\/data-flair.training\/blogs\/6-important-reasons-to-learn-apache-spark\/\">6 important reasons to learn Apache Spark<\/a><\/strong><\/li>\n<li><strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-hadoop-compatibility\/\">Apache Spark Compatibility with Hadoop<\/a><\/strong><\/li>\n<\/ul>\n<p><span hidden class=\"__iawmlf-post-loop-links\" data-iawmlf-links=\"[{&quot;id&quot;:2354,&quot;href&quot;:&quot;http:\\\/\\\/spark.apache.org&quot;,&quot;archived_href&quot;:&quot;http:\\\/\\\/web-wp.archive.org\\\/web\\\/20251009215151\\\/https:\\\/\\\/spark.apache.org\\\/&quot;,&quot;redirect_href&quot;:&quot;&quot;,&quot;checks&quot;:[{&quot;date&quot;:&quot;2025-12-11 04:17:36&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-14 07:11:19&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-17 07:55:29&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-20 14:34:27&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-23 15:49:42&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-26 15:59:57&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-30 07:08:03&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-02 07:19:25&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-05 08:37:45&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-08 09:28:47&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-11 11:37:40&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-14 12:46:43&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-17 20:26:14&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-20 20:31:00&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-24 06:20:15&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-27 06:26:56&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-30 07:17:36&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-02 07:26:54&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-05 10:18:07&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-08 12:50:55&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-11 14:05:53&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-14 15:00:31&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-18 00:17:52&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-21 06:52:12&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-24 08:35:32&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-27 08:54:29&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-02 09:01:11&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-05 09:57:46&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-08 12:27:51&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-11 12:42:39&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-14 23:54:40&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-18 03:00:10&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-21 06:08:58&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-24 07:13:58&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-27 09:23:48&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-30 11:37:48&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-02 13:11:14&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-05 14:53:20&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-08 19:36:36&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-11 23:42:38&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-15 01:00:01&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-18 06:16:05&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-21 07:55:15&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-24 09:26:05&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-27 11:00:27&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-30 12:57:25&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-03 13:36:16&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-06 19:54:59&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-10 07:47:43&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-13 09:22:32&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-16 16:11:08&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-19 16:22:48&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-22 17:30:06&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-25 20:07:42&quot;,&quot;http_code&quot;:503},{&quot;date&quot;:&quot;2026-05-29 03:42:28&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-01 10:44:16&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-04 10:50:12&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-07 10:53:31&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-10 11:32:26&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-13 13:26:22&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-17 00:18:56&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-20 03:02:15&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-23 10:46:05&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-26 12:51:38&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-29 12:51:51&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-07-02 13:46:38&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-07-05 16:50:32&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-07-09 05:30:27&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-07-12 14:20:23&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-07-15 20:00:46&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-07-19 06:45:58&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-07-22 08:44:36&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-07-25 16:31:41&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-07-29 00:52:20&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-08-02 00:44:28&quot;,&quot;http_code&quot;:206}],&quot;broken&quot;:false,&quot;last_checked&quot;:{&quot;date&quot;:&quot;2026-08-02 00:44:28&quot;,&quot;http_code&quot;:206},&quot;process&quot;:&quot;done&quot;}]\"><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Objective In this Spark tutorial, we are going to understand different ways of how to create RDDs in Apache Spark. We will understand Spark RDDs and 3 ways of creating RDDs in Spark&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":73667,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[946,11340,11351,11352,11575,13106,13142],"class_list":["post-1830","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-spark","tag-apache-spark-rdds","tag-rdd","tag-rdds-in-apache-spark","tag-rdds-in-spark","tag-resilient-distributed-dataset-in-apache-spark","tag-spark-rdd-creation","tag-spark-tutorial"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>How to Create RDDs in Apache Spark? - DataFlair<\/title>\n<meta name=\"description\" content=\"Learn what is RDD in Spark,how to create RDDs in Apache Spark-using parallelized collection,from existing RDDs &amp; from external datasets,3 ways to create RDD\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Create RDDs in Apache Spark? - DataFlair\" \/>\n<meta property=\"og:description\" content=\"Learn what is RDD in Spark,how to create RDDs in Apache Spark-using parallelized collection,from existing RDDs &amp; from external datasets,3 ways to create RDD\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2017-03-30T12:20:58+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-12-11T11:08:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"802\" \/>\n\t<meta property=\"og:image:height\" content=\"420\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to Create RDDs in Apache Spark? - DataFlair","description":"Learn what is RDD in Spark,how to create RDDs in Apache Spark-using parallelized collection,from existing RDDs & from external datasets,3 ways to create RDD","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/","og_locale":"en_US","og_type":"article","og_title":"How to Create RDDs in Apache Spark? - DataFlair","og_description":"Learn what is RDD in Spark,how to create RDDs in Apache Spark-using parallelized collection,from existing RDDs & from external datasets,3 ways to create RDD","og_url":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2017-03-30T12:20:58+00:00","article_modified_time":"2019-12-11T11:08:36+00:00","og_image":[{"width":802,"height":420,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"How to Create RDDs in Apache Spark?","datePublished":"2017-03-30T12:20:58+00:00","dateModified":"2019-12-11T11:08:36+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/"},"wordCount":1014,"commentCount":5,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2.jpg","keywords":["Apache Spark RDDs","RDD","RDDs in Apache Spark","RDDs in spark","resilient distributed dataset in Apache Spark","Spark RDD creation","spark tutorial"],"articleSection":["Apache Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/","url":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/","name":"How to Create RDDs in Apache Spark? - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2.jpg","datePublished":"2017-03-30T12:20:58+00:00","dateModified":"2019-12-11T11:08:36+00:00","description":"Learn what is RDD in Spark,how to create RDDs in Apache Spark-using parallelized collection,from existing RDDs & from external datasets,3 ways to create RDD","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/03\/ways-to-create-RDDs-in-spark-2.jpg","width":802,"height":420,"caption":"how to create RDDs in spark"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/create-rdds-in-apache-spark\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/spark\/"},{"@type":"ListItem","position":3,"name":"How to Create RDDs in Apache Spark?"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/1830","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=1830"}],"version-history":[{"count":7,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/1830\/revisions"}],"predecessor-version":[{"id":73669,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/1830\/revisions\/73669"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/73667"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=1830"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=1830"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=1830"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}