

{"id":5829,"date":"2018-01-17T09:04:04","date_gmt":"2018-01-17T09:04:04","guid":{"rendered":"https:\/\/data-flair.training\/blogs\/?p=5829"},"modified":"2018-09-17T17:56:15","modified_gmt":"2018-09-17T12:26:15","slug":"spark-mllib-data-types","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/","title":{"rendered":"Spark MLlib Data Types | Apache Spark Machine Learning"},"content":{"rendered":"<div class='__iawmlf-post-loop-links' style='display:none;' data-iawmlf-post-links='[{&quot;id&quot;:2052,&quot;href&quot;:&quot;https:\\\/\\\/spark.apache.org&quot;,&quot;archived_href&quot;:&quot;http:\\\/\\\/web-wp.archive.org\\\/web\\\/20251009215151\\\/https:\\\/\\\/spark.apache.org\\\/&quot;,&quot;redirect_href&quot;:&quot;&quot;,&quot;checks&quot;:[{&quot;date&quot;:&quot;2025-12-11 00:11:34&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-14 03:24:05&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-17 05:06:29&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-20 07:19:55&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-23 14:10:46&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-26 19:03:14&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2025-12-30 13:05:23&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-02 13:25:12&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-05 14:08:05&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-09 10:16:58&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-12 11:04:53&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-15 17:09:49&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-18 18:39:09&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-21 19:15:09&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-26 04:14:49&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-01-29 05:32:17&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-01 07:55:30&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-04 10:44:57&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-07 12:28:46&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-11 00:52:17&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-14 12:51:24&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-17 14:17:39&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-20 17:49:34&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-24 04:42:19&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-02-27 06:25:21&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-02 08:44:49&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-05 10:27:17&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-08 11:13:11&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-11 12:04:06&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-14 12:32:46&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-18 01:16:16&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-21 21:29:48&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-25 06:37:35&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-28 07:59:07&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-03-31 10:36:07&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-04 11:16:36&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-07 18:11:02&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-11 05:09:37&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-14 06:26:10&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-18 15:58:17&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-22 11:10:25&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-27 06:59:55&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-04-30 12:38:54&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-03 15:24:36&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-06 17:05:30&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-10 12:07:21&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-14 23:33:58&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-19 11:27:54&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-23 02:59:38&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-05-29 05:05:46&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-01 06:55:32&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-04 20:59:59&quot;,&quot;http_code&quot;:206},{&quot;date&quot;:&quot;2026-06-08 05:37:55&quot;,&quot;http_code&quot;:206}],&quot;broken&quot;:false,&quot;last_checked&quot;:{&quot;date&quot;:&quot;2026-06-08 05:37:55&quot;,&quot;http_code&quot;:206},&quot;process&quot;:&quot;done&quot;}]'><\/div>\n<h2>1. Objective &#8211; Spark MLlib Data Types<\/h2>\n<p>Today, in this Spark tutorial, we will learn about all the <strong><a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-tutorial\/\">Apache Spark<\/a><\/strong> MLlib Data Types.\u00a0<strong><a href=\"https:\/\/data-flair.training\/blogs\/machine-learning-tutorial\/\">Machine learning library<\/a><\/strong> supports many Data Types. Moreover, in this Spark Machine Learning Data Types, we will discuss local vector, labeled points, local matrix, and distributed matrix.<\/p>\n<p>So, let&#8217;s start Spark MLlib Data Types Tutorial.<\/p>\n<div id=\"attachment_5850\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5850\" class=\"wp-image-5850 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1.jpg\" alt=\"Spark MLlib Data Types\" width=\"1200\" height=\"628\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1.jpg 1200w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-768x402.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-1024x536.jpg 1024w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><p id=\"caption-attachment-5850\" class=\"wp-caption-text\">Spark MLlib Data Types | Apache Spark Machine Learning<\/p><\/div>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-mcqs-part-3\/\">Test how much you learned in Spark so far.<\/a><\/strong><\/p>\n<h2>2. Spark MLlib Data Types &#8211; RDD-based API<\/h2>\n<p><span style=\"font-weight: 400\">Basically, <strong><a href=\"https:\/\/data-flair.training\/blogs\/machine-learning-applications\/\">Machine learning library<\/a><\/strong> supports many\u00a0Data Types. Such as local vectors and matrices stored on a single machine. Similarly, distributed matrices backed by one or more <a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-tutorial\/\"><strong>RDDs<\/strong><\/a>. Moreover, local vectors and local matrices are simple data models. However, that serve as public interfaces. <\/span><br \/>\n<span style=\"font-weight: 400\">In addition, the breeze offers the underlying linear algebra operations. A training example which we use in supervised learning is what we call a \u201clabeled point\u201d in MLlib.<\/span><br \/>\n<span style=\"font-weight: 400\">Basically, Spark MLlib Data Types are of the following categories. Such as,<\/span><\/p>\n<ol>\n<li><span style=\"font-weight: 400\"> Local vector<\/span><\/li>\n<li><span style=\"font-weight: 400\"> Labeled point<\/span><\/li>\n<li><span style=\"font-weight: 400\"> Local matrix<\/span><\/li>\n<li><span style=\"font-weight: 400\"> Distributed matrix<\/span><\/li>\n<\/ol>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">RowMatrix<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">IndexedRowMatrix<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">CoordinateMatrix<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">BlockMatrix<\/span><\/li>\n<\/ul>\n<p>So, let&#8217;s discuss these Spark MLlib Data Types in detail\u00a0 &#8211;<\/p>\n<h3>a. Local Vector Data Types<\/h3>\n<p><span style=\"font-weight: 400\">Basically, it has integer-typed and 0-based indices and double-typed values. That is stored on a single machine. Moreover, there are two types of local vectors, which <a href=\"https:\/\/data-flair.training\/blogs\/bayesian-network-applications\/\">Spark MLlib<\/a> supports, such as dense and sparse Vector.<\/span><\/p>\n<h4>i. Dense vector data types<\/h4>\n<p><span style=\"font-weight: 400\">A vector which is backed by a double array representing, its entry values is known as Dense Vector.<\/span><\/p>\n<h4>ii. Sparse vector data types<\/h4>\n<p><span style=\"font-weight: 400\">A vector which is backed by two parallel arrays like, indices and values are known as Sparse Vector. <\/span><br \/>\nFor example,<br \/>\n<span style=\"font-weight: 400\">We can represent a vector (1.0, 0.0, 3.0) in dense format as [1.0, 0.0, 3.0]. Whereas in sparse format as (3, [0, 2], [1.0, 3.0]). Here, 3 is the size of the vector.<\/span><br \/>\n<span style=\"font-weight: 400\">Vector is base class of local vectors. Also, it provides two implementations, such as DenseVector and SparseVector. Moreover, to create local vectors we recommend using the factory methods implemented in Vectors.<\/span><\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/spark-interview-questions\/\">Prepare yourself for Spark Interviews<\/a><\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import org.apache.spark.mllib.linalg.{Vector, Vectors}\r\n\/\/ Create a dense vector (1.0, 0.0, 3.0).\r\nval dv: Vector = Vectors.dense(1.0, 0.0, 3.0)\r\n\/\/ Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.\r\nval sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))\r\n\/\/ Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.\r\nval sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))<\/pre>\n<p><b>Note:<\/b><br \/>\n<span style=\"font-weight: 400\">By default Scala imports <\/span><b>scala.collection.immutable.Vector<\/b><span style=\"font-weight: 400\">. Hence, it is must to import <\/span><b>org.apache.spark.mllib.linalg.Vector<\/b><span style=\"font-weight: 400\"> explicitly to use MLlib\u2019s Vector.<\/span><\/p>\n<h3>b. Labeled Point\u00a0Data Types in Spark<\/h3>\n<p><span style=\"font-weight: 400\">Either dense or sparse, associated with a label\/response, a labeled point is a local vector.\u00a0<span class=\"adverb\">Basically<\/span>, supervised learning algorithms use labeled points, in <a href=\"https:\/\/data-flair.training\/blogs\/artificial-neural-network-applications\/\">MLlib<\/a>.<\/span> \u00a0Moreover, we use a double to store a label. <span class=\"complexword\">Therefore<\/span>, we can use labeled points in both regressions as well as classification. <span class=\"hardreadability\">Furthermore, a label should be either 0 (negative) or 1 (positive), for binary classification<\/span>. However, for multiclass classification, labels should be class indices starting from zero: 0, 1, 2,&#8230;<br \/>\n<span style=\"font-weight: 400\">In addition, a labeled point is represented by the case class LabeledPoint.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import org.apache.spark.mllib.linalg.Vectors\r\nimport org.apache.spark.mllib.regression.LabeledPoint\r\n\/\/ Create a labeled point with a positive label and a dense feature vector.\r\nval pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))\r\n\/\/ Create a labeled point with a negative label and a sparse feature vector.\r\nval neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))<\/pre>\n<h3>c. Local Matrix Data Types<\/h3>\n<p><span style=\"font-weight: 400\">Basically, it has integer-typed row and column indices. Also, double-typed values. All those stored on a single machine. Moreover, MLlib supports dense matrices. In which entry values are stored in a single double array in column-major order also, sparse matrices. In which non-zero entry values are stored in the compressed sparse column (CSC) format in column-major order. Let\u2019s understand this as an example. Here, a dense matrix is stored in a one-dimensional array. [1.0, 3.0, 5.0, 2.0, 4.0, 6.0]. In which the matrix size is (3, 2).<\/span><\/p>\n<div id=\"attachment_5834\" style=\"width: 124px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Capture7.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5834\" class=\"wp-image-5834 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Capture7.png\" alt=\"Spark MLlib Data Types\" width=\"114\" height=\"85\" \/><\/a><p id=\"caption-attachment-5834\" class=\"wp-caption-text\">Local Matrix Spark MLlib Data Types<\/p><\/div>\n<p><span style=\"font-weight: 400\">In addition, a matrix is the base class of local matrices. Moreover, it provides two implementations, such as DenseMatrix, and SparseMatrix. Here, to create local matrices we recommend using the factory methods implemented in matrices. <\/span><br \/>\n<strong>Note:<\/strong><br \/>\n<span style=\"font-weight: 400\">In MLlib, local matrices are stored in column-major order.<\/span><\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-machine-learning-algorithm\/\">Let&#8217;s revise the Spark Machine learning algorithm<\/a><\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import org.apache.spark.mllib.linalg.{Matrix, Matrices}\r\n\/\/ Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))\r\nval dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))\r\n\/\/ Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))\r\nval sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))<\/pre>\n<h3>d. Distributed Matrix Data Types<\/h3>\n<p><span style=\"font-weight: 400\">A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more <a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-tutorial\/\"><strong>RDDs<\/strong><\/a>. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. There are four types of distributed matrices, such as\u00a0<\/span><br \/>\n<b>Note: <\/b><br \/>\n<span style=\"font-weight: 400\">The underlying RDDs of a distributed matrix must be deterministic because we cache the matrix size. In general, the use of non-deterministic RDDs can lead to errors.<\/span><\/p>\n<h4>i. RowMatrix in Spark MLlib<\/h4>\n<p><span style=\"font-weight: 400\">It is a row-oriented distributed matrix without meaningful row indices. Basically, it is backed by an RDD of its rows, here each row is a local vector. However, here local vector represents each row. Hence, integer range limits the number of columns. Although it should be much smaller in practice.<\/span><br \/>\n<span style=\"font-weight: 400\">In addition,\u00a0we can create it from an RDD[Vector] instance. Afterwards, we can compute its column summary statistics and decompositions. Moreover, QR decomposition is of the form A = QR. Here Q is an orthogonal matrix. Whereas R is an upper triangular matrix.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import org.apache.spark.mllib.linalg.Vector\r\nimport org.apache.spark.mllib.linalg.distributed.RowMatrix\r\nval rows: RDD[Vector] = ... \/\/ an RDD of local vectors\r\n\/\/ Create a RowMatrix from an RDD[Vector].\r\nval mat: RowMatrix = new RowMatrix(rows)\r\n\/\/ Get its size.\r\nval m = mat.numRows()\r\nval n = mat.numCols()\r\n\/\/ QR decomposition \r\nval qrResult = mat.tallSkinnyQR(true)<\/pre>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/spark-machine-learning-with-r\/\">Have a look at Spark Machine Learning with R<\/a><\/strong><\/p>\n<h4>ii. IndexedRowMatrix<\/h4>\n<p><span style=\"font-weight: 400\">It is as same as RowMatrix. The only difference here is, with meaningful row indices. Moreover, IndexRowMatrix is backed by an RDD of indexed rows. Hence, its index (long-typed) and a local vector represent its every row.<\/span><br \/>\n<span style=\"font-weight: 400\">In addition,\u00a0we can create it from an RDD[IndexedRow] instance. Here IndexedRow is a wrapper over (Long, Vector). Also, we can convert it to a RowMatrix by dropping its row indices.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}\r\nval rows: RDD[IndexedRow] = ... \/\/ an RDD of indexed rows\r\n\/\/ Create an IndexedRowMatrix from an RDD[IndexedRow].\r\nval mat: IndexedRowMatrix = new IndexedRowMatrix(rows)\r\n\/\/ Get its size.\r\nval m = mat.numRows()\r\nval n = mat.numCols()\r\n\/\/ Drop its row indices.\r\nval rowMat: RowMatrix = mat.toRowMatrix()<\/pre>\n<h4>iii. CoordinateMatrix<\/h4>\n<p><span style=\"font-weight: 400\">Basically, it is backed by an RDD of its entries. We can say here, each entry is a tuple of (i: Long, j: Long, value: Double). However, it is the row index, whereas j is the column index. Also, the value is the entry value. Moreover, we only use CoordinateMatrix when both dimensions of the matrix are huge. As well as the matrix is very sparse.<\/span><br \/>\n<span style=\"font-weight: 400\">In addition, we can create it from an RDD[MatrixEntry] instance. Here MatrixEntry is a wrapper over (Long, Long, Double). Also, we can convert a CoordinateMatrix to an IndexedRowMatrix with sparse rows by calling to IndexedRowMatrix. <\/span><br \/>\n<b>Note<\/b><span style=\"font-weight: 400\">:<\/span><br \/>\n<span style=\"font-weight: 400\">For CoordinateMatrix other computations do not support currently.<\/span><\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/fault-tolerance-in-apache-spark\/\">You must read\u00a0fault tolerance in Apache Spark<\/a><\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}\r\nval entries: RDD[MatrixEntry] = ... \/\/ an RDD of matrix entries\r\n\/\/ Create a CoordinateMatrix from an RDD[MatrixEntry].\r\nval mat: CoordinateMatrix = new CoordinateMatrix(entries)\r\n\/\/ Get its size.\r\nval m = mat.numRows()\r\nval n = mat.numCols()\r\n\/\/ Convert it to an IndexRowMatrix whose rows are sparse vectors.\r\nval indexedRowMatrix = mat.toIndexedRowMatrix()<\/pre>\n<h4>iv. BlockMatrix<\/h4>\n<p><span style=\"font-weight: 400\">It is backed by an RDD of MatrixBlocks. Here a MatrixBlock is a tuple of ((Int, Int), Matrix). Also, the (Int, Int) is the index of the block. Moreover, here Matrix is the sub-matrix at the given index with size rowsPerBlock x colsPerBlock. <\/span><br \/>\n<span style=\"font-weight: 400\">In addition, from an IndexedRowMatrix or CoordinateMatrix by calling toBlockMatrix, we can most easily create it. By default, to BlockMatrix creates blocks of size 1024 x 1024. Moreover, by supplying the values through to BlockMatrix(rowsPerBlock, colsPerBlock) users can change the block size.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}\r\nval entries: RDD[MatrixEntry] = ... \/\/ an RDD of (i, j, v) matrix entries\r\n\/\/ Create a CoordinateMatrix from an RDD[MatrixEntry].\r\nval coordMat: CoordinateMatrix = new CoordinateMatrix(entries)\r\n\/\/ Transform the CoordinateMatrix to a BlockMatrix\r\nval matA: BlockMatrix = coordMat.toBlockMatrix().cache()\r\n\/\/ Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.\r\n\/\/ Nothing happens if it is valid.\r\nmatA.validate()\r\n\/\/ Calculate A^T A.\r\nval ata = matA.transpose.multiply(matA)<\/pre>\n<p>So, this was all in Apache Spark MLlib Data Types. Hope you like our explanation.<\/p>\n<h2>3. Conclusion<\/h2>\n<p>Hence, in this Spark MLlib Data Types, we have seen all Data Types of <a href=\"https:\/\/data-flair.training\/blogs\/deep-learning\/\">Machine Learning<\/a>. However, if you want to ask any Query regarding Spark MLlin Data Types, feel free to ask in the comment section.<\/p>\n<p><strong>See also &#8211;<\/strong><\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/data-type-mapping-between-r-and-spark\/\">Data Type mapping Between R and Spark<\/a><\/strong><br \/>\n<a href=\"https:\/\/spark.apache.org\/\">For reference<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Objective &#8211; Spark MLlib Data Types Today, in this Spark tutorial, we will learn about all the Apache Spark MLlib Data Types.\u00a0Machine learning library supports many Data Types. Moreover, in this Spark Machine&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":6337,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[3488,3996,8053,8360,8366,8447,15351],"class_list":["post-5829","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-spark","tag-data-types-in-machine-learning","tag-distributed-matrix-data-types","tag-labeled-point-data-types","tag-local-matrix-data-types","tag-local-vector-data-types","tag-machine-learning-data-types","tag-various-categorizes-of-data-types"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Spark MLlib Data Types | Apache Spark Machine Learning - DataFlair<\/title>\n<meta name=\"description\" content=\"Data Types in Spark Machine Learning- Spark MLlib Data Types in RDD-based API,datatypes-Local vector,Labeled point,Local matrix,Distributed matrix\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Spark MLlib Data Types | Apache Spark Machine Learning - DataFlair\" \/>\n<meta property=\"og:description\" content=\"Data Types in Spark Machine Learning- Spark MLlib Data Types in RDD-based API,datatypes-Local vector,Labeled point,Local matrix,Distributed matrix\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-17T09:04:04+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-09-17T12:26:15+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Spark MLlib Data Types | Apache Spark Machine Learning - DataFlair","description":"Data Types in Spark Machine Learning- Spark MLlib Data Types in RDD-based API,datatypes-Local vector,Labeled point,Local matrix,Distributed matrix","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/","og_locale":"en_US","og_type":"article","og_title":"Spark MLlib Data Types | Apache Spark Machine Learning - DataFlair","og_description":"Data Types in Spark Machine Learning- Spark MLlib Data Types in RDD-based API,datatypes-Local vector,Labeled point,Local matrix,Distributed matrix","og_url":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2018-01-17T09:04:04+00:00","article_modified_time":"2018-09-17T12:26:15+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-1.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"Spark MLlib Data Types | Apache Spark Machine Learning","datePublished":"2018-01-17T09:04:04+00:00","dateModified":"2018-09-17T12:26:15+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/"},"wordCount":1161,"commentCount":0,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-1.jpg","keywords":["Data Types in Machine learning","Distributed matrix data types","Labeled point data types","Local matrix data types","Local vector data types","Machine Learning Data Types","various categorizes of data types"],"articleSection":["Apache Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/","url":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/","name":"Spark MLlib Data Types | Apache Spark Machine Learning - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-1.jpg","datePublished":"2018-01-17T09:04:04+00:00","dateModified":"2018-09-17T12:26:15+00:00","description":"Data Types in Spark Machine Learning- Spark MLlib Data Types in RDD-based API,datatypes-Local vector,Labeled point,Local matrix,Distributed matrix","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-1.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Data-Types-in-Machine-learning-Apache-Spark-01-01-1-1.jpg","width":1200,"height":628,"caption":"data types in Machine learning"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/spark-mllib-data-types\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/spark\/"},{"@type":"ListItem","position":3,"name":"Spark MLlib Data Types | Apache Spark Machine Learning"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/5829","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=5829"}],"version-history":[{"count":4,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/5829\/revisions"}],"predecessor-version":[{"id":34458,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/5829\/revisions\/34458"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/6337"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=5829"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=5829"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=5829"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}