Linear regression in R

4 min readFeb 24, 2024

Linear regression is a fundamental statistical technique for modeling the relationship between a continuous dependent variable and one or more independent variables. In this blog post, we explore the steps to conducting linear regression in R using the parameters and performance packages. With the help of these packages, we can easily fit and evaluate linear regression models, gaining valuable insights into the relationships between variables in the dataset.

Let’s first run a regression model using the mtcars data and store that for further inspection:

model = lm(mpg ~ am+cyl+carb, mtcars)

> model %>% parameters()
Parameter   | Coefficient |   SE |          95% CI | t(23) |      p
-------------------------------------------------------------------
(Intercept) |       23.68 | 1.62 | [ 20.34, 27.02] | 14.66 | < .001
am [1]      |        4.27 | 1.44 | [  1.30,  7.25] |  2.97 | 0.007 
cyl [6]     |       -2.73 | 1.99 | [ -6.84,  1.38] | -1.37 | 0.183 
cyl [8]     |       -6.90 | 1.97 | [-10.98, -2.83] | -3.50 | 0.002 
carb [2]    |       -0.23 | 1.54 | [ -3.42,  2.96] | -0.15 | 0.883 
carb [3]    |       -0.48 | 2.44 | [ -5.54,  4.58] | -0.20 | 0.847 
carb [4]    |       -3.94 | 1.85 | [ -7.76, -0.12] | -2.13 | 0.044 
carb [6]    |       -5.53 | 3.59 | [-12.94,  1.89] | -1.54 | 0.137 
carb [8]    |       -6.05 | 3.73 | [-13.78,  1.67] | -1.62 | 0.119

The results look nice and clean. We could beautify the table with the help of the pubh package:

library(pubh)

model %>%
    tbl_regression() %>% 
    cosm_reg() %>% theme_pubh()  

                     ─────────────────────────────────────────────
                       Variable          Beta (95% CI)   p-value  
                     ─────────────────────────────────────────────
                       am                                  0.007  
                       0                             —            
                       1              4.3 (1.3 to 7.2)            
                       cyl                                 0.004  
                       4                             —            
                       6            -2.7 (-6.8 to 1.4)            
                       8            -6.9 (-11 to -2.8)            
                       carb                                 0.19  
                       1                             —            
                       2           -0.23 (-3.4 to 3.0)            
                       3           -0.48 (-5.5 to 4.6)            
                       4          -3.9 (-7.8 to -0.12)            
                       6             -5.5 (-13 to 1.9)            
                       8             -6.1 (-14 to 1.7)            
                     ─────────────────────────────────────────────

Now, let’s interpret the model output:

The coefficient estimates represent the relationship between the predictor variables (am, cyl, and carb) and the outcome variable (mpg).

The intercept term is estimated to be 23.68, indicating the expected value of mpg when all predictor variables are equal to zero.

For the predictor variable ‘am’ (which represents the type of transmission), the coefficient estimate is 4.27. This means that holding other variables constant, cars with an automatic transmission (am=1) have, on average, 4.27 higher mpg compared to cars with a manual transmission (am=0).

The ‘cyl’ variable represents the number of cylinders in the engine. The coefficient estimate for ‘cyl’ is -2.73 for cars with 6 cylinders and -6.90 for cars with 8 cylinders, relative to cars with 4 cylinders. These negative coefficients suggest that as the number of cylinders increases, the mpg tends to decrease i.e. the relationship between these two variables is negative.

The ‘carb’ variable represents the number of carburetors. The coefficient estimates for different levels of ‘carb’ suggest a mixed effect on mpg, with some levels (carb==4) having negative coefficients (indicating lower mpg) and others having no significant association with mpg.

The standard errors (SE) indicate the uncertainty associated with the coefficient estimates. The 95% confidence intervals show the range within which the true population coefficients will likely fall.

Model performance

Next, we will examine model performance using the performance function from the performance package. These performance metrics provide insights into the quality of the linear regression model and its ability to explain the variability in the dependent variable. A lower AIC, AICc, and BIC, along with higher R-squared and adjusted R-squared values, indicate better model fit and predictive power. Similarly, a lower RMSE and sigma value indicate smaller errors between the observed and predicted values.

> model %>% performance()
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
161.392 | 163.700 | 168.721 | 0.811 |     0.791 | 2.577 | 2.755

The R-squared value of 0.811 indicates that approximately 81.1% of the variance in the dependent variable (mpg) is explained by the predictor variables (am, cyl, and carb) in the model. The adjusted R-squared value of 0.791 accounts for the number of predictors in the model. It provides a more conservative estimate of the model’s explanatory power, considering the degrees of freedom.

In addition to this, let’s also take advantage of the residual function to visually inspect the residuals. Residuals are calculated as the difference between the observed values and the predicted values from the regression model.

Residual = Observed — Predicted

In short, residual plots help evaluate how well the regression model fits the data. By examining the patterns and distribution of the residuals, we can determine if the model adequately captures the underlying relationships between the predictors and the outcome variable.

Residual plots also help check collinearity assumption by showing if there are any linear patterns in the data that the model may have missed. If our assumptions are correct, the points should be scattered evenly above and below the line of 0. If the residuals exhibit a clear pattern (e.g., a curved or U-shaped trend), it suggests that the relationship between the predictors and the outcome may be non-linear.

 
# Obtain the residuals from the model
data <- data.frame(Residuals = resid(model))

# Plot the residuals
ggplot(data, aes(x = seq_along(Residuals), y = Residuals)) +
    geom_point() +
    geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
    xlab("Observation") +
    ylab("Residuals") +
    ggtitle("Residual Plot")

Linear regression in R

Model performance

Residual = Observed — Predicted

Written by infoart.ca