Mastering Multiple Linear Regression In R
Linear Regression in R — Part 2: Multiple Regression and Forward Selection
In the previous article I talked about linear regression in R: how to perform linear regression, how to plot the results and how to asses the quality of the fit. I will leave the link here if you are interested.
In this article I would like to expand the model and introduce some new concepts, which will allow us to make sharper predictions and inferences about the data.
Multiple Linear Regression
Quick theoretical recap
Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor, this is why we will use extend the simple linear regression model so that it can directly accommodate multiple predictors. We can do this by giving each predictor a separate slope coefficient in a single model. If we have p distinct predictors. Then the multiple inear regression model takes the form
Y = β0 + β1X1 + β2X2 + ·· · + βpXp + ϵ
where Xj represents the jth predictor and βj quantifies the association between that variable and Y. We interpret βj as the average effect on Y of a one unit increase in Xj, holding all other predictors fixed.
Viewing and understanding the data frame
We will need again the ISLR2 library, but this time we are going to use the Carseats dataset, which is a simulated data set containing sales of child car seats at 400 different stores.
install.packages("ISLR2")
library(ISLR2)
#Display Carseats dataframe
Carseats
head(Carseats)
dim(Carseats)
#Get the description and relevant information
?Carseats
Using the lm() function to perform multiple linear regression
First of all we try to perform linear regression using only one variable. We are interested in predicting Sales using Price, the price that the company charges for car seats at each site, as the predictor variable.
lm.fit <- lm(Sales ~ Price, data = Carseats)
summary(lm.fit)
#Output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.641915 0.632812 21.558 <2e-16 ***
Price -0.053073 0.005354 -9.912 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.532 on 398 degrees of freedom
Multiple R-squared: 0.198, Adjusted R-squared: 0.196
F-statistic: 98.25 on 1 and 398 DF, p-value: < 2.2e-16
We see that the effect of a unit increase in Price on Sales is significantly different than zero, but this model still has a low R-squared, meaning that it isn’t able to fully capture the variation of Sales based solely on Price.
We then try to add one predictor, Advertising, the local advertising budget for company at each location. This is now a multiple linear regression model.
lm2.fit <- lm(Sales ~ Price + Advertising, data = Carseats)
summary(lm2.fit)
Output:
> summary(lm2.fit)
Call:
lm(formula = Sales ~ Price + Advertising, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-7.9011 -1.5470 -0.0223 1.5361 6.3748
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.003427 0.606850 21.428 < 2e-16 ***
Price -0.054613 0.005078 -10.755 < 2e-16 ***
Advertising 0.123107 0.018079 6.809 3.64e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.399 on 397 degrees of freedom
Multiple R-squared: 0.2819, Adjusted R-squared: 0.2782
F-statistic: 77.91 on 2 and 397 DF, p-value: < 2.2e-16
We can now see that not only the coefficient of Advertising is different from zero at the 1% significance level, but also that the R-squared of the model has increased to 28%, that’s a 50% increase !
Plotting the results
We can even plot the results, since this model has only two variables, but in general it isn’t as helpful in understanding the results as it may be with the simple regression model or requires a very messy chunk of code. I will still leave you a simple code snippet to use if you are interested in trying to visualize the results. You will need the scatterplot3d
library.
library(scatterplot3d)
s3d <- scatterplot3d(Carseats$Price, Carseats$Advertising, Carseats$Sales,
main = "Regression Plane", angle = 125, pch=16)
s3d$plane3d(lm2.fit, draw_polygon = TRUE)
In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane.
Deciding on the important variables
How can we decide which variables to include in the model ?
The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size. However we often can not examine all possible models, since there are 2^p of them. In our case there would be 2¹⁰ = 1024 possible models.
Forward Selection
One way to decide is to perform forward selection. These are the steps:
- Begin with the null model — a model that contains an intercept but no predictors.
- Fit p simple linear regression models by adding one predictor at a time to the null model and selecting the one with the lowest residual sum of squares (RSS).
- Add the predictor that results in the lowest RSS to the null model.
- Repeat step 2 and 3 until a stopping rule is satisfied.
To do this in R you can create a linear regression model, called fit0
, that has all the coefficients equal to zero. Create another multiple regression model, fit2
, which contains all variables.
fit0 <- lm(Sales ~ 1, data=Carseats)
fit2 <- lm(Sales ~ ., data = Carseats)
Then use the function step(object, direction, scope, k)
to perform the forward selection. The function doesn’t exactly choose the lowest RSS but
uses a criterion that is basically equivalent.
step(fit0, direction="forward",
scope=list(lower = fit0, upper = fit2), k=0)
object
is used as the initial model in the stepwise searchscope
defines the range of models examined in the stepwise searchdirection
is the mode of stepwise searchk
is the multiple of the number of degrees of freedom used for the penalty
If you run this code you will get a very long output, with many numbers. You have to look at the AIC (Akaike information criterion). I won’t go in more detail for now on what that means, but the basic idea is that the lower the AIC the better.
Start: AIC=829.55
Sales ~ 1
Df Sum of Sq RSS AIC
+ ShelveLoc 2 1009.53 2172.7 676.91
+ Price 1 630.03 2552.2 741.31
+ Advertising 1 231.14 2951.1 799.39
+ Age 1 171.01 3011.3 807.46
+ US 1 99.80 3082.5 816.81
+ Income 1 73.48 3108.8 820.21
+ CompPrice 1 13.07 3169.2 827.91
+ Education 1 8.59 3173.7 828.47
+ Population 1 8.11 3174.2 828.53
+ Urban 1 0.76 3181.5 829.46
<none> 3182.3 829.55
Step: AIC=676.91
Sales ~ ShelveLoc
Df Sum of Sq RSS AIC
+ Price 1 717.10 1455.6 516.69
+ Advertising 1 178.35 1994.4 642.65
+ Age 1 175.00 1997.8 643.32
+ Income 1 99.93 2072.8 658.08
+ US 1 66.54 2106.2 664.47
+ Population 1 10.71 2162.0 674.94
+ CompPrice 1 6.00 2166.7 675.81
+ Education 1 4.19 2168.6 676.14
+ Urban 1 1.70 2171.1 676.60
<none> 2172.7 676.91
I included here the output of the first two steps of the forward selection process. You can see that R starts with the regression model Sales ~ 1
which has an AIC of 829.55, then notices that by adding ShelveLoc to the model it would reduce the AIC to 676.91. The new model is Sales ~ ShelveLoc
. After that R also sees that it could add another variable to the model in order to further decrease the AIC. Can you guess how the model will look in the next step ? Look at the variable that with the lowest AIC associated !
Solution: Sales ~ ShelveLoc + Price
And so on until the lowest possible AIC obtainable by only adding variables is reached.
Conclusion
In conclusion, multiple linear regression is a powerful tool to predict a response based on several predictor variables. In this article, we introduced the concept of multiple linear regression and used the Carseats dataset to demonstrate how to perform multiple linear regression in R using the lm()
function. We started by using only one variable to predict Sales and then added Advertising as a predictor variable, which increased the R-squared of the model by 50%. Although it can be challenging to plot the results of a multiple regression model, we introduced a code snippet using the scatterplot3d library to visualize the results. Then we talked about forward regression and how to choose the variables to include in the model. Understanding and mastering multiple linear regression is an essential skill for data analysts, data scientists and statisticians who wish to make more accurate predictions and inferences about their data.