Gradient Boosting Algorithm – Working and Improvements

Free Machine Learning courses with 130+ real-time projects Start Now!!

In this Machine Learning Tutorial, we will study Gradient Boosting Algorithm. Also, we will learn Boosting Algorithm history & purpose. Along with this, we will also study the working of Gradient Boosting Algorithm, at last, we will discuss improvements to Gradient Boosting Algorithm.

Gradient Boosting Algorithm

What is Gradient Boosting Algorithm

What is Gradient Boosting in Machine Learning?

Gradient boosting is a machine learning technique for regression and classification problems. That produces a prediction model in the form of an ensemble of weak prediction models.
The accuracy of a predictive model can be boosted in two ways:
a. Either by embracing feature engineering or
b. By applying boosting algorithms straight away.
There are many boosting algorithms like
  • Gradient Boosting,
  • XGBoost,
  • AdaBoost,
  • Gentle Boost etc.
Every boosting algorithm has its own underlying mathematics. Also, a slight variation is observed while applying them.

History of Boosting Algorithm

Boosting Algorithm is one of the most powerful learning ideas introduced in the last twenty years. It was designed for classification problems, but it can be extended to regression as well. The motivation for Gradient boosting was a procedure. That combines the outputs of many “weak” classifiers to produce a powerful “committee.” A weak classifier (e.g. decision tree) is one whose error rate is only better than random guessing.

Purpose of Boosting Algorithm

The purpose of boosting Algorithm is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers $G_m(x)$, $m = 1, 2, … , M$.

Stagewise Additive Modeling

Boosting builds an additive model:
$$F(x) = \sum_{m=1}^M \beta_m b(x; \gamma_m)$$
where $b(x; \gamma_m)$ is a tree and $\gamma_m$ parameterizes the splits. With boosting, the parameters, $(\beta_m, \gamma_m)$ fit in a stage-wise fashion. This slows the process down and overfits less quickly.

AdaBoost

AdaBoost builds an additive logistic regression model by stagewise fitting
IN AdaBoost, we use an exponential loss function of the form-
$L(y, F(x)) = exp(-yF(x))$, like the negative binomial log-likelihood loss.

Gradient Boosting Algorithm

Friedman’s Gradient Boosting Algorithm for a generic loss function, $L(y_i, \gamma)$:
Source: Elements of Statistical Learning
What is Gradient Boosting Algorithm

What is Gradient Boosting Algorithm

a. Loss Functions and Gradients

Source: Elements of Statistical Learning
The optimal number of iterations, T, and the learning rate, λ, depend on each other.

Stochastic Gradient Boosting Algorithm

This was proposed by the stochastic gradient boosting algorithm. Also, it samples without replacement from the dataset before estimating the next step. He found that this additional step improved performance.

How is the Gradient Boosting Algorithm Works?

Gradient boosting Algorithm involves three elements:

  • A loss function to be optimized.
  • Weak learner to make predictions.
  • An additive model to add weak learners to minimize the loss function.

a. Loss Function

  • The loss function used depends on the type of problem being solved.
  • It must be differentiable. Although, many standard loss functions are supported and you can define your own.

b. Weak Learner

  • We use decision trees as the weak learner in gradient boosting algorithm.
  • Specifically, we use regression tree that output real values for splits. And whose output can be added together. It allows next models outputs to be added and “correct” the residuals in the predictions.
  • Trees need to construct in a greedy manner. It helps in choosing the best split points based on purity scores like Gini or to cut the loss.
  • Initially, such as in the case of AdaBoost. Also, we use very short decision trees that only had a single split, called a decision stump.
  • Generally, we use larger trees with 4-to-8 levels.
  • It is common to constrain the weak learners in specific ways. Such as a maximum number of layers, nodes, splits or leaf nodes.
  • This is to ensure that the learners remain weak, but can still need to construct in a greedy manner.

Furthermore, every tree built with a big size can accommodate more complicated details of the data and make the model work well in the subsequent application. However, this also implies that the model is subjected to over-fitting and sometimes cross-validation is needed in order to test the right tree size and number of iterations.

c. Additive Model

  • Trees need to add one at a time, and existing trees in the model need not change.
  • We use a gradient descent procedure to minimize the loss when adding trees.
  • Traditionally, we use gradient tree to cut a set of parameters. Such as the coefficients in a regression equation or weights in a neural network. After calculating error or loss, the weights need to be update to minimize that error.
  • Instead of parameters, we have weak learner sub-models or more specifically decision trees. After calculating the loss, to perform the gradient descent procedure. We must add a tree to the model that reduces the loss.
  • We do this by parameterizing the tree. Then change the parameters of the tree and move in the right direction by (reducing the residual loss.

Learn about Pros And Cons of Machine Learning

Improvements to Basic Gradient Boosting Algorithm

Gradient boosting algorithm is a greedy algorithm and can overfit a training dataset quickly.
It can enjoy regularization methods. That penalize various parts of boosting algorithm. And generally improve the performance of the algorithm by reducing overfitting.
  • Tree Constraints
  • Shrinkage
  • Random sampling
  • Penalized Learning

a. Tree Constraints

  • It is important that the weak learners have skill but remain weak.
  • There are many ways that the trees need to be a constraint.

b. Weighted Updates

  • The predictions of each tree have to add together sequentially.
  • The contribution of each tree to this sum needs to be weight to slow down the learning by the algorithm. This weighting is referred as a shrinkage or a learning rate.

c. Stochastic Gradient Boosting algorithm

  • A big insight into bagging ensembles. Also, the random forest was allowing trees to create. 
  • This same benefit can be used to reduce the correlation between the trees.
  • This variation of boosting is referred as stochastic gradient boosting.

d. Penalized Gradient Boosting algorithm

We can impose additional constraints on the parameterized trees.
We can’t use classical decision tree as weak learners. Instead, a modified form called a regression tree is used that has numeric values in the leaf nodes. The values in the leaves of the trees can be called weights in some literature.
As such, the leaf weight values of the trees have to regularize. For this, we use popular regularization functions, such as:
  • L1 regularization of weights.
  • L2 regularization of weights.

Also, L1 and L2 norms are used in regularization so that the coefficients are not large and hence there be no overfitting. This is especially important when applying the algorithm to high-dimensional data since there is higher possibility of over-fitting.

Conclusion

As a result, we have studied Gradient Boosting Algorithm. Also, have learned Gradient Boosting Algorithm history, purpose and it’s working. As Gradient Boosting Algorithm is a very hot topic. Moreover, we have covered everything related to Gradient Boosting Algorithm in this blog. Furthermore, if you feel any query, feel free to ask in a comment section.
For reference

If you are Happy with DataFlair, do not forget to make us happy with your positive feedback on Google

courses

DataFlair Team

DataFlair Team creates expert-level guides on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our goal is to empower learners with easy-to-understand content. Explore our resources for career growth and practical learning.

1 Response

  1. JOSE says:

    I got 97.82% of accuracy while using the credit card dataset with oversampling using adasyn and applied to GBTClassifier(gradient Boost) in python. how can I improve the accuracy

Leave a Reply

Your email address will not be published. Required fields are marked *