Site icon DataFlair

R Decision Trees – The Best Tutorial on Tree Based Modeling in R!

FREE Online Courses: Click for Success, Learn for Free - Start Now!

In this tutorial, we will cover all the important aspects of the Decision Trees in R. We will build these trees as well as comprehend their underlying concepts. We will also go through their applications, types as well as various advantages and disadvantages.

Let’s now begin with the tutorial on R Decision Trees.

What is R Decision Trees?

Decision Trees are a popular Data Mining technique that makes use of a tree-like structure to deliver consequences based on input decisions. One important property of decision trees is that it is used for both regression and classification. This type of classification method is capable of handling heterogeneous as well as missing data. Decision Trees are further capable of producing understandable rules. Furthermore, classifications can be performed without many computations.

As mentioned above, both the classification and regression tasks can be performed with the help of Decision Trees. You can perform either classification or regression tasks here. Decision Trees can be visualised as follows:

Applications of Decision Trees

Decision Trees are used in the following areas of applications:

How to Create Decision Trees in R

The Decision Tree techniques can detect criteria for the division of individual items of a group into predetermined classes that are denoted by n.

In the first step, the variable of the root node is taken. This variable should be selected based on its ability to separate the classes efficiently. This operation starts with the division of variable into the given classes. This results in the creation of subpopulations. This operation repeats until no separation can be obtained.

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

A tree exhibiting not more than two child nodes is a binary tree. The origin node is referred to as a node and the terminal nodes are the trees.

To create a decision tree, you need to follow certain steps:

1. Choosing a Variable

The choice depends on the type of Decision Tree. Same goes for the choice of the separation condition.

In the case of a binary variable, there is only one separation whereas, for a continuous variable, there are n-1 possibilities.

The separation condition is as follows:

X <= mean(xk, xk+1)

After finding the best separation, the operation is repeated to increase discrimination among the nodes.

The density of the node is its ratio of the individuals to the entire population.

After finding the best separation, classes are split into child nodes. We derive a variable out of this step. We choose the best separation criteria as:

The X2 Test – For testing the independence of variables X and Y, we use  X2, only if:

Oij provides us with the left-hand side of the equality symbol and Tij provides the term on the right, independence test of X and Y is X2.

This degree of freedom is calculated as:

p = (no. of rows – 1) * (no. of columns – 1)

The Gini Index – With this test, we measure the purity of nodes. All types of dependent variables use it and we calculate it as follows:

In the preceding formula: fi, i=1, . ., p, corresponds to the frequencies in the node of the class p that we need to predict.

With an increase in distribution, the Gini index will also increase. However, with the increase in the purity of the node, the Gini index decreases.

Wait a minute! Have you checked – Logistic Regression in R

2. Assigning Data to Nodes

After the construction is completed and the decision criteria is established, every individual is assigned one leaf. The independent variable determines this assignment. This leaf is assigned only if the cost of the assignment is higher than another leaf to the present class.

3. Pruning the Tree

In order to remove irrelevant nodes from the trees, we perform pruning. If a large-sized tree is created, followed by automatic pruning, then we refer to that algorithm as good. We perform cross-validation and aggregate the error rates for all the subtrees in order to select the best one. Furthermore, we shorten the deep tree branches to limit the creation of the small nodes.

Common R Decision Trees Algorithms

There are three most common Decision Tree Algorithms:

1. Classification and Regression Tree (CART)

CART is the most popular and widely used Decision Tree. The primary tool in CART used for finding the separation of each node is the Gini Index.

Performance and Generality are the two advantages of a CART tree:

CART has the following drawbacks as well:

You must definitely have a look at Binomial and Poisson Distribution in R

2. Chi-Square Automation Interaction Detection (CHAID)

CHAID was developed as an early Decision Tree based on the 1963 model of AID tree. As opposed to CHAID, it does not substitute the missing values with the equally reducing values. All the missing values are taken as a single class which facilitates merging with another class. For finding the significant variable, we make use of the X2 test. This is only valid for qualitative or discrete variables. We create CHAID in the following steps:

CHAID trees are wider than deeper. Furthermore, there is no pruning function available for it. Also, the construction halts when the largest tree is created.

With the help of CHAID, we can transform quantitative data into a qualitative one.

Take a deep dive into Contingency Tables in R

Guidelines for Building Decision Trees in R

Decision Trees belong to the class of recursive partitioning algorithms that can be implemented easily. The algorithm for building decision tree algorithms are as follows:

This halt condition may pose complications in the form of a statistical significance test or the least record count. Since Decision Trees are non-linear predictors, the decision boundaries between the target class are also non-linear. Based on the number of splits, the non-linearities change.

Some of the important guidelines for creating decision trees are as follows:

You must learn about the Non-Linear Regression in R

Decision Tree Options

Get a deep insight into the Chi-Square Test in R with Examples

How to Build Decision Trees in R

We will use the rpart package for building our Decision Tree in R and use it for classification by generating a decision and regression trees. We will use recursive partitioning as well as conditional partitioning to build our Decision Tree. R builds Decision Trees as a two-stage process as follows:

We will make use of the popular titanic survival dataset. We will first import our essential libraries like rpart, dplyr, party, rpart.plot etc.

#Author DataFlair
library(rpart)
library(readr)
library(caTools)
library(dplyr)
library(party)
library(partykit)
library(rpart.plot)

After this, we will read our data and store it inside the titanic_data variable.

titanic_data <- "https://goo.gl/At238b" %>%  #DataFlair
  read.csv %>% # read in the data
  select(survived, embarked, sex, 
         sibsp, parch, fare) %>%
  mutate(embarked = factor(embarked),
         sex = factor(sex))

Output:

After this, we will split our data into training and testing sets as follows:

set.seed(123)
sample_data = sample.split(titanic_data, SplitRatio = 0.75)
train_data <- subset(titanic_data, sample_data == TRUE)
test_data <- subset(titanic_data, sample_data == FALSE)

Output:

Then, we will proceed to plot our Decision Tree using the rpart function as follows:

rtree <- rpart(survived ~ ., sample_data$train_data)
rpart.plot(rtree)

Output:

We will also plot our conditional parting plot as follows:

ctree_ <- ctree(survived ~ ., train_data)
plot(ctree_)

Output:

Let’s master the Survival Analysis in R Programming

Prediction by Decision Tree

Similar to classification, Decision Trees can also be used for prediction. In order to carry out the latter, it changes the node split criterion. The aim of implementing this is:

You should be able to select the child nodes that are able to reduce intra-class variance and amplify inter-class variance. For example, a CHAID tree. Given an input of 163 countries, it groups them into five clusters based on the differences that their citizens share on the basis of their GNP. After the groups are made, there is another split based on life expectancy.

Advantages of R Decision Trees

Decision Trees are highly popular in Data Mining and Machine Learning techniques. Following are the advantages of Decision Trees:

Get to know about the Machine Learning Techniques with Python

Disadvantages of R Decision Trees

Decision Trees possess the following disadvantages:

This was all about our tutorial on Decision Trees. We hope you liked it and gained useful information.

Summary

In the above tutorial, we understood the concept of R Decision Trees. We further discussed the various applications of these trees. Furthermore, we learnt how to build these trees in R. We also learnt some useful algorithms such as CART and CHAID. We went through the advantages and disadvantages of R trees.

Don’t forget to check out the article on Random Forest in R Programming

If you have any query, we will be glad to solve them for you.

Exit mobile version