OLS Regression in R – 8 Simple Steps to Implement OLS Regression Model
Struggling in implementing OLS regression In R?
Don’t worry, you landed on the right page. This article is a complete guide of Ordinary Least Square (OLS) regression modelling. It will make you an expert in writing any command and creat OLS in R.
1. What is OLS Regression in R?
OLS Regression in R programming is a type of statistical technique, that is being used for modeling. Also, used for the analysis of linear relationships between a response variable. If there is a relationship between two variables appears to be linear. A straight line can be used to model a relationship by fitting to the data.
The linear equation for a bivariate regression takes the following form:
Where y = response(dependent) variable
m = gradient(slope)
x = predictor(independent) variable
c = is the intercept
Get a free guide for Linear Regression in R with Examples
2. OLS in R (Linear Model Estimation Using Ordinary Least Squares)
ols(formula, data, weights, subset, na.action=na.delete,
x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE,
penalty=0, penalty.matrix, tol=1e-7, sigma,
These are arguments used in OLS in R-
- Formula – an S formula object, e.g.
Y ~ rcs(x1,5)*lsp(x2,c(10,20))
- Data – It is the name of an S data frame containing all needed variables.
- Weights – We use it in the fitting process.
- Subset – It is an expression that defines a subset of the observations to use in the fit. The default is to use all observations.
- na.action – This specifies an S function to handle missing data.
- Method – This specifies a particular fitting method, or “model.frame”.
- Model – The default is FALSE. it is set to TRUE. This attribute returns the model frame in the form of an element that is able to fit the object.
- X – The default is FALSE. Set to TRUE to return the expanded design matrix as element x of the returned fit object. First set both x=TRUE if you are going to use the residuals function.
- Y – The default is FALSE. Set to TRUE to return the vector of response values as element y of the fit.
- Se.fit – The default is FALSE. It is set to TRUE. That computes the estimated standard errors of the estimate of Xβ. And also store them in element se.fit of the fit.
- Linear.predictors – It is set FALSE as default. That is being used to cause predicted values not to be stored.
- Penalty penalty.matrix – see lrm
- Tol – Tolerance for information matrix singularity.
- Sigma – If sigma is being given, then we can use it as the actual root mean squared error parameter for the model. We also use sigma as an estimation from the data that consists of the usual formulas.
- Var.penalty – It is the type of variance-covariance matrix. That is to be stored in the var component of the fit when penalization is being used.
- p. – we pass the arguments to lm.wfit or lm.fit.
Do you know How to Create & Access R Matrix?
3. OLS Data Analysis: Descriptive Stats
- Several built-in commands for describing data has been present in R.
- Also, we use list() command to get the output of all elements of an object.
- Moreover, summary() command to describe all variables contained within a data frame.
- We use summary() command also with individual variables.
- Simple plots can also provide familiarity with the data.
- We use the hist() command which produces a histogram for any given data values.
- We use the plot() command. That produces both univariate and bivariate plots for any given objects.
3.1 Other Useful Commands
4. OLS Regression Commands for Data Analysis
These are useful OLS Regression commands for data analysis
- lm – Linear Model.
- lme – Mixed effects.
- glm – General lm.
- Multinomial – Multinomial Logit.
- Optim – General Optimizer.
Before we move further in OLS Regression, you have to master in Importing data in R.
5. How to Implement OLS Regression in R?
To implement OLS in R, we will use the lm command that performs linear modeling. The dataset that we will be using is the UCI Boston Housing Prices that are openly available.
For the implementation of OLS regression in R we use this Data (CSV)
So, let’s start the steps with our first R linear regression model –
First, we import the important library that we will be using in our code –
Now, we read our data that is present in the .csv format (CSV stands for Comma Separated Values).
> data = read.csv("/home/admin1/Desktop/Data/hou_all.csv")
Now, we will display the compact structure of our data and its variables with the help of str() function.
Then to get a brief idea about our data, we will output the first 6 data values using the head() function.
We will obtain the following output –
X0.00632 X18 X2.31 X0 X0.538 X6.575 X65.2 X4.09 X1 X296 X15.3 X396.9 X4.98 X24 X1.1
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 1
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 1
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 1
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 1
5 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7 1
6 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9 1
Now, in order to have an understanding of the various statistical features of our labels like mean, median, 1st Quartile value etc. we use the summary() function.
We obtain the following output –
X0.00632 X18 X2.31 X0 X0.538
Min. : 0.00906 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850
1st Qu.: 0.08221 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490
Median : 0.25915 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380
Mean : 3.62067 Mean : 11.35 Mean :11.15 Mean :0.06931 Mean :0.5547
3rd Qu.: 3.67822 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710
X6.575 X65.2 X4.09 X1 X296
Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0
1st Qu.:5.885 1st Qu.: 45.00 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0
Median :6.208 Median : 77.70 Median : 3.199 Median : 5.000 Median :330.0
Mean :6.284 Mean : 68.58 Mean : 3.794 Mean : 9.566 Mean :408.5
3rd Qu.:6.625 3rd Qu.: 94.10 3rd Qu.: 5.212 3rd Qu.:24.000 3rd Qu.:666.0
Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0
X15.3 X396.9 X4.98 X24 X1.1
Min. :12.60 Min. : 0.32 Min. : 1.73 Min. : 5.00 Min. :1
1st Qu.:17.40 1st Qu.:375.33 1st Qu.: 7.01 1st Qu.:17.00 1st Qu.:1
Median :19.10 Median :391.43 Median :11.38 Median :21.20 Median :1
Mean :18.46 Mean :356.59 Mean :12.67 Mean :22.53 Mean :1
3rd Qu.:20.20 3rd Qu.:396.21 3rd Qu.:16.96 3rd Qu.:25.00 3rd Qu.:1
Max. :22.00 Max. :396.90 Max. :37.97 Max. :50.00 Max. :1
Now, we take our first step towards building our linear model. Firstly, we initiate the set.seed() function with the value of 125. In R, set.seed() allows you to randomly generate numbers for performing simulation and modeling.
The next important step is to divide our data in training data and test data. We set the percentage of data division to 75%, meaning that 75% of our data will be training data and the rest 25% will be the test data.
> data_split = sample.split(data, SplitRatio = 0.75) > train <- subset(data, data_split == TRUE) > test <-subset(data, data_split == FALSE)
Now that our data has been split into training and test set, we implement our linear modeling model as follows:
model <- lm(X1.1 ~ X0.00632 + X6.575 + X15.3 + X24, data = train) #DataFlair
Lastly, we display the summary of our model using the same summary() function that we had implemented above.
We obtain the following output –
lm(formula = X1.1 ~ X0.00632 + X6.575 + X15.3 + X24, data = train)
Min 1Q Median 3Q Max
-1.673e-15 -4.040e-16 -1.980e-16 -3.800e-17 9.741e-14
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.000e+00 4.088e-15 2.446e+14 <2e-16 ***
X0.00632 1.616e-18 3.641e-17 4.400e-02 0.965
X6.575 2.492e-16 5.350e-16 4.660e-01 0.642
X15.3 5.957e-17 1.428e-16 4.170e-01 0.677
X24 3.168e-17 4.587e-17 6.910e-01 0.490
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.12e-15 on 365 degrees of freedom
Multiple R-squared: 0.4998, Adjusted R-squared: 0.4944
F-statistic: 91.19 on 4 and 365 DF, p-value: < 2.2e-16
And, that’s it! You have implemented your first OLS regression model in R using linear modeling!
It’s right to uncover the Logistic Regression in R?
6. OLS Diagnostics in R
- Post-estimation diagnostics are key to data analysis.
- Furthermore, we can use diagnostics. That allows us the opportunity to show off some of the R’s graphs. What could be driving our driving our data?
-outlier: Basically, it is an unusual observation.
-Leverage: Generally, it has the ability to change the slope of the regression line.
-Influence: Moreover, the combined impact of strong leverage and outlier status.
Hence, we have seen how OLS regression in R using ordinary least squares exist. Also, we have learned its usage as well as its command. Moreover, we have studied diagnostic in R which helps in showing graph. Now, you are master in OLS regression in R with knowledge of every command.
If you have any suggestion or feedback, please comment below.