Gradient Boosting Algorithm – Working and Improvements

DataFlair Team

6 years ago

Free Machine Learning courses with 130+ real-time projects Start Now!!

1. Gradient Boosting Algorithm

In this Machine Learning Tutorial, we will study Gradient Boosting Algorithm. Also, we will learn Boosting Algorithm history & purpose. Along with this, we will also study the working of Gradient Boosting Algorithm, at last, we will discuss improvements to Gradient Boosting Algorithm.

What is Gradient Boosting Algorithm

2. What is Gradient Boosting in Machine Learning?

Gradient boosting is a machine learning technique for regression and classification problems. That produces a prediction model in the form of an ensemble of weak prediction models.

The accuracy of a predictive model can be boosted in two ways:

a. Either by embracing feature engineering or

b. By applying boosting algorithms straight away.

There are many boosting algorithms like

Gradient Boosting,

XGBoost,

AdaBoost,

Gentle Boost etc.

Every boosting algorithm has its own underlying mathematics. Also, a slight variation is observed while applying them.

3. History of Boosting Algorithm

Boosting Algorithm is one of the most powerful learning ideas introduced in the last twenty years. It was designed for classification problems, but it can be extended to regression as well. The motivation for Gradient boosting was a procedure. That combines the outputs of many “weak” classifiers to produce a powerful “committee.” A weak classifier (e.g. decision tree) is one whose error rate is only better than random guessing.

4. Purpose of Boosting Algorithm

The purpose of boosting Algorithm is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers $G_m(x)$, $m = 1, 2, … , M$.

Read about Machine Learning Algorithms

5. Stagewise Additive Modeling

Boosting builds an additive model:

$$F(x) = \sum_{m=1}^M \beta_m b(x; \gamma_m)$$

where $b(x; \gamma_m)$ is a tree and $\gamma_m$ parameterizes the splits. With boosting, the parameters, $(\beta_m, \gamma_m)$ fit in a stage-wise fashion. This slows the process down and overfits less quickly.

6. AdaBoost

AdaBoost builds an additive logistic regression model by stagewise fitting

IN AdaBoost, we use an exponential loss function of the form-

$L(y, F(x)) = exp(-yF(x))$, like the negative binomial log-likelihood loss.

7. Gradient Boosting Algorithm

Friedman’s Gradient Boosting Algorithm for a generic loss function, $L(y_i, \gamma)$:

Source: Elements of Statistical Learning

What is Gradient Boosting Algorithm

a. Loss Functions and Gradients

Source: Elements of Statistical Learning

The optimal number of iterations, T, and the learning rate, λ, depend on each other.

Read about Machine Learning Applications in the real world

8. Stochastic Gradient Boosting Algorithm

This was proposed by the stochastic gradient boosting algorithm. Also, it samples without replacement from the dataset before estimating the next step. He found that this additional step improved performance.

9. How is the Gradient Boosting Algorithm Works?

Gradient boosting Algorithm involves three elements:

A loss function to be optimized.

Weak learner to make predictions.

An additive model to add weak learners to minimize the loss function.

a. Loss Function

The loss function used depends on the type of problem being solved.

It must be differentiable. Although, many standard loss functions are supported and you can define your own.

b. Weak Learner

We use decision trees as the weak learner in gradient boosting algorithm.

Specifically, we use regression tree that output real values for splits. And whose output can be added together. It allows next models outputs to be added and “correct” the residuals in the predictions.

Trees need to construct in a greedy manner. It helps in choosing the best split points based on purity scores like Gini or to cut the loss.

Initially, such as in the case of AdaBoost. Also, we use very short decision trees that only had a single split, called a decision stump.

Generally, we use larger trees with 4-to-8 levels.

It is common to constrain the weak learners in specific ways. Such as a maximum number of layers, nodes, splits or leaf nodes.

This is to ensure that the learners remain weak, but can still need to construct in a greedy manner.

c. Additive Model

Trees need to add one at a time, and existing trees in the model need not change.

We use a gradient descent procedure to minimize the loss when adding trees.

Traditionally, we use gradient tree to cut a set of parameters. Such as the coefficients in a regression equation or weights in a neural network. After calculating error or loss, the weights need to be update to minimize that error.

Instead of parameters, we have weak learner sub-models or more specifically decision trees. After calculating the loss, to perform the gradient descent procedure. We must add a tree to the model that reduces the loss.

We do this by parameterizing the tree. Then change the parameters of the tree and move in the right direction by (reducing the residual loss.

Learn about Pros And Cons of Machine Learning

10. Improvements to Basic Gradient Boosting Algorithm

Gradient boosting algorithm is a greedy algorithm and can overfit a training dataset quickly.

It can enjoy regularization methods. That penalize various parts of boosting algorithm. And generally improve the performance of the algorithm by reducing overfitting.

Tree Constraints

Shrinkage

Random sampling

Penalized Learning

a. Tree Constraints

It is important that the weak learners have skill but remain weak.

There are many ways that the trees need to be a constraint.

b. Weighted Updates

The predictions of each tree have to add together sequentially.

The contribution of each tree to this sum needs to be weight to slow down the learning by the algorithm. This weighting is referred as a shrinkage or a learning rate.

c. Stochastic Gradient Boosting algorithm

A big insight into bagging ensembles. Also, the random forest was allowing trees to create.

This same benefit can be used to reduce the correlation between the trees.

This variation of boosting is referred as stochastic gradient boosting.

d. Penalized Gradient Boosting algorithm

We can impose additional constraints on the parameterized trees.

We can’t use classical decision tree as weak learners. Instead, a modified form called a regression tree is used that has numeric values in the leaf nodes. The values in the leaves of the trees can be called weights in some literature.

As such, the leaf weight values of the trees have to regularize. For this, we use popular regularization functions, such as:

L1 regularization of weights.

L2 regularization of weights.

11. Conclusion

As a result, we have studied Gradient Boosting Algorithm. Also, have learned Gradient Boosting Algorithm history, purpose and it’s working. As Gradient Boosting Algorithm is a very hot topic. Moreover, we have covered everything related to Gradient Boosting Algorithm in this blog. Furthermore, if you feel any query, feel free to ask in a comment section.
For reference