SVM – Support Vector Machine Tutorial for Beginners
1. SVM Tutorial – Objective
In Support Vector Machine tutorial, we are going to deeply understand what is SVM? We will also discuss SVM algorithm on the basis of the separable and nonseparable case, linear SVM and SVM advantages & disadvantages in detail.
In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
So, let’s start SVM Tutorial.
2. SVM Introduction
SVM stands for Support Vector Machine. It is a machine learning approach used for classification and regression analysis. It depends on supervised learning models and trained by learning algorithms. They analyze the large amount of data to identify patterns from them.
An SVM generates parallel partitions by generating two parallel lines. For each category of data in a high-dimensional space and uses almost all attributes. It separates the space in a single pass to generate flat and linear partitions. Divide the 2 categories by a clear gap that should be as wide as possible. Do this partitioning by a plane called hyperplane.
An SVM creates hyperplanes that have the largest margin in a high-dimensional space to separate given data into classes. The margin between the 2 classes represents the longest distance between closest data points of those classes.
The larger the margin, the lower is the generalization error of the classifier.
After training map the new data to the same space to predict which category they belong to. Categorize the new data into different partitions and achieve it by training data.
Of all the available classifiers, SVM provides the largest flexibility.
SVMs are like probabilistic approaches but do not consider dependencies among attributes.
3. SVM Algorithm
To understand the algorithm of SVM, consider two cases:
- Separable case – Infinite boundaries are possible to separate the data into two classes.
- Non Separable case – Two classes are not separated but overlap with each other.
3.1. The Separable Case
In the separable case, infinite boundaries are possible. The boundary that gives the largest distance to the nearest observation is called the optimal hyperplane. The optimal hyperplane ensures the fit and robustness of the model. To find the optimal hyperplane, use the following equation.
Here, a.x is the scalar product of a and x. This equation must satisfy the following two conditions:
- It should separate the two classes A and B very well so that the function defined by:
- f(x) = a.x + b is positive if and only if x ∈ A
- f(x) ≤ 0 if and only if x ∈ B
- It exists as far away as possible from all the observations (robustness of the model). Given that the distance from an observation x to the hyperplane is | a.x + b|/||a||.
The width of the space between observations is 2/||a||. It is called margin and it should be largest.
Hyperplane depends on support points called the closest points. Generalization capacity of SVM increases as the number of support points decreases.
3.2. The Non-Separable Case
If two classes are not perfectly separated but overlap. A term measuring the classification error must add to each of the following two conditions:
- For every i, yi(a.xi + b) ≥ 1 (correct separation)
- 1/2 ||a||2 is minimal (greatest margin)
Define these condition for each observation xi on the wrong side of the boundary. By measuring the distance separating it from the boundary of the margin on the side of its class.
This distance is then normalized by dividing it by the half-margin 1/||a||, giving a term i, called the slack variable. An error in the model is an observation for which ξ > 1.The sum of all the ξi represents the set of classification errors. So, the previous two constraints for finding the optimal hyperplane become:
- For every i, yi(a.xi + b) ≥ 1 – ξi
- 1/2 ||a||2 + δΣi ξi is minimal
The quantity δ is a parameter that penalizes errors. It controls the adaptation of the model to the errors. As this increases and sensitivity to errors rise, adaptation also increases.
In SVMs, the process of restructuring data is known as transformation and do it with the help of a function. Refer this function as the transformation function and represented by the symbol (Φ). Technically, the transformation functions map the dot product of data points to a higher dimensional place.
Another way of handling the nonseparable case is to move to a space having a high enough dimension for there to be a linear separation. Search for a nonlinear transformation for moving from original space to a higher dimensional space. But choose one which has a scalar product.
4. Linear SVM
We can use Linear SVM for finding the largest and smallest margin hyperplane that divides the training data D, and a set of n points.
If the training data is separable, then select two hyperplanes in a way that they separate the data. There are no points between them and the distance between them known as margin. It can maximize the margin. You can calculate the distance between these 2 hyperplanes by applying simple geometry. You can measure distance directly by 2/||a|| quantity. To increase the distance you have to reduce||a||.
- Primal Form – Primal form helps to better solve the linear SVM problem. It uses standard quadratic programming techniques and programs.
- Dual Form – You can use the dual form to write classification rules as an unconstrained system. By doing this you get hyperplane with greatest possible margin. In such cases, represent classification process as a function of support vector machines. A subset of training data lies on
Biased and Unbiased Hyperplanes
Represent data points and hyperplanes in the same coordinate system. Divide hyperplanes into 2 types on the basis of their coordinates as:
- Biased hyperplanes – Hyperplanes that do not pass through the origin of the coordinate system.
- Unbiased hyperplanes – Those that pass through the origin of the coordinate system.
5. Advantages and Disadvantages of SVM
Let us now look at some advantages and disadvantages of SVM.
- Advantages – SVMs can model nonlinear phenomena by the choice of an appropriate kernel method. SVMs generally provide precise predictions. SVMs determine the optimal hyperplane by the nearest points (support vectors) only and not by distant points. This thus enhances the robustness of the model in some cases.
- Disadvantage – The models are opaque. Although you can explain them with a decision tree, there is a risk of loss or precision. SVMs are very sensitive to the choice of the kernel parameters. The difficulty in choosing the correct kernel parameters may compel you to test many possible values. As a result, the computation time is sometimes lengthy.
So, this was all about SVM Tutorial. Hope you like our explanation.
In conclusion, to support vector machine, it is the most popular machine learning algorithm. It is the maximal-margin classifier that explains how actually SVM works. It is implemented practically using kernel. And the learning of the hyperplane in linear SVM is done by transforming the problem using some linear algebra, which is out of the scope of this introduction to SVM.
If you have any questions about SVM or this post? Ask in the comments and I will do my best to answer.
See Also –
Reference – Machine Learning