Mastering Linear Regression in R

Joe Godot
5 min readFeb 28, 2023

--

The R programming language is a popular choice among data analysts and statisticians for its ability to manipulate and analyze data. In this article, we will explore the functionality of R by examining the Boston dataset, which contains information about various suburbs in Boston.

Viewing and understanding the data frame

First, we install and load the ISLR2 library and display the Boston data frame to explore its structure and contents.

install.packages("ISLR2")
library(ISLR2)

# Display data frame Boston
Boston
head(Boston)
class(Boston)
str(Boston)
dim(Boston) #506 observations -> 506 suburbs of Boston

#Get the description and information about the Boston data frame
?Boston

The data frame contains 506 observations or suburbs in Boston, and each observation has 14 variables. The variables include per capita crime rate, average number of rooms per dwelling, the median value of owner-occupied homes in $1000’s, ….

Using the lm() function to perform linear regression

We then access the help page of the lm function, which is the function we will use for linear regression analysis.

#Access help page of lm function
?lm

The function takes a first parameter the formula and as second parameter the data we want to use. We make use of lm to fit a linear regression model of median value of owner-occupied homes (medv) against the percentage of the population classified as lower status (lstat).

lm(medv ~ lstat, data=Boston)
lm.fit<- lm(medv ~ lstat, data=Boston)
lm.fit

We should get this output:

Call:
lm(formula = medv ~ lstat, data = Boston)

Coefficients:
(Intercept) lstat
34.55 -0.95

We have computed the intercept and the slope of the regression line. It is also possible to compute the 95% confidence interval for the regression coefficients using the confint function.

> confint(lm.fit)
2.5 % 97.5 %
(Intercept) 33.448457 35.6592247
lstat -1.026148 -0.8739505

Plotting the results

Now we can plot the data and the regression line using the functions plot for the former and abline for the latter.

plot(medv ~ lstat, data = Boston, pch=16)
abline(lm.fit, lwd=2, col='red')

Assessing the quality of the regression

We assess the quality of the fit using summary statistics and residual plots.

> summary(lm.fit)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16

From this output we notice that the coefficients have very low p-values, i.e. Pr(>|t|), and are therefore significantly different from zero (we reject the null hypothesis). The R-squared is 0.5441. This means that the X variable, lstat, explains 54% of the variance in Y, medv.

We then plot the residuals against the fitted values and notice positive residuals for low and high values of lstat and negative residuals for medium values.

plot(residuals(lm.fit) ~ fitted(lm.fit))
abline(h=0, lty = 2, col='blue')
#with h=0 we get an horizontal line at y = 0 and
#with lty=2 we get dashed line(default lty=1 normal line)

This suggests that a linear model may not be the best fit for the data. Instead, a more complex model, such as a quadratic or higher-order polynomial model, may be necessary to capture the non-linear relationship between lstat and medv. This is because a linear model assumes a constant relationship between the predictor variable and the response variable, which may not be the case when the relationship is non-linear. As a result, a linear model may under-predict values for low and high lstat values and over-predict values for medium lstat values, leading to positive residuals for the former and negative residuals for the latter.

Finally we want to understand if there are outliers with high leverage (influence on the regression). To this end we use the function hatvalues to compute the leverage of the observations and then look for the ones with highest leverage.

hatvalues(lm.fit) 
plot(hatvalues(lm.fit))
max(hatvalues(lm.fit))

> which.max(hatvalues(lm.fit))
375
375

The observation 375 has the highest leverage. High leverage in linear regression might be a problem because it means that some observations in the data set have a larger effect on the regression results than others. Leverage refers to how far an observation’s predictor variable values are from the mean of the predictor variables. High leverage observations are those that have a value of the predictor variable that is far from the mean and they can have a large impact on the estimated regression coefficients. In extreme cases, high leverage observations can cause the regression line to be overly influenced by a single observation, leading to unreliable regression results. High leverage observations can also cause problems with the accuracy of the predicted values of the response variable.

Predicting new or chosen values

If you want to predict a Y value for a specific X, and also get confidence/prediction interval, you can use the predict function. Remeber to pass the new data as a data frame.

> predict(lm.fit, newdata = data.frame(lstat=c(5, 10, 15)), 
interval = 'prediction') # use 'confidence' if you want a CI
fit lwr upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310 8.077742 32.52846

Conclusion

In conclusion, this article has provided an overview of linear regression and its implementation in R using the Boston data set from the ISLR2 package. We have covered various aspects of linear regression, including model fitting, assessing model quality, detecting outliers and influential observations, and the potential problem of high leverage.

As we have seen, linear regression can be a powerful tool for modeling relationships between variables and making predictions. However, it is important to carefully evaluate the quality and reliability of the regression model, especially when dealing with potential outliers or influential observations.

If you found this article helpful and are interested in learning more about polynomial regression and its implementation in R, please let me know in the comments section. I would be happy to provide a follow-up article on this topic. Thank you for reading!

For part 2, click this link

--

--

Joe Godot
Joe Godot

Written by Joe Godot

Statistics, machine learning and quantitative finance connoisseur, programming lover, and humanities aficionado.