1 Introduction

Model selection is the process of choosing the most relevant features from a set of candidate variables. This procedure is crucial because it ensures that the final model is both accurate and interpretable while being computationally efficient and avoiding overfitting. Stepwise regression algorithms iteratively add or remove features from the model based on certain criteria (e.g., significance level or P-value, information criteria like AIC or BIC, etc.). The process continues until no further improvements can be made according to the chosen criterion. At the end of the stepwise procedure, you’ll have a final model that includes the selected features and their coefficients.

StepReg simplifies model selection tasks by providing a unified programming interface. It currently supports model buildings for five distinct response variable types (section 3.1), four model selection strategies (section 3.2) including the best subsets algorithm, and a variety of selection metrics (section 3.3). Moreover, StepReg detects and addresses the multicollinearity issues if they exist (section 3.4). The output of StepReg includes multiple tables summarizing the final model and the variable selection procedures. Additionally, StepReg offers a plot function to visualize the selection steps (section 4). For demonstration, the vignettes include four use cases covering distinct regression scenarios (section 5). Non-programmers can access the tool through the iterative Shiny app detailed in section ??.

2 Quick demo

The following example selects an optimal linear regression model with the mtcars dataset.

library(StepReg)
data(mtcars)
formula <- mpg ~ .
res <- stepwise(formula = formula,
                data = mtcars,
                type = "linear",
                include = c("qsec"),
                strategy = "bidirection",
                metric = c("AIC"))

Breakdown of the parameters:

  • formula: specifies the dependent and independent variables
  • type: specifies the regression category, depending on your data, choose from “linear”, “logit”, “cox”, etc.
  • include: specifies the variables that must be in the final model
  • strategy: specifies the stepwise strategy, choose from “forward”, “backward”, “bidirection”, “subset”
  • metric: specifies the model fit evaluation metric, choose one or more from “AIC”, “AICc”, “BIC”, “SL”, etc.

The output consists of final model, which can be viewed using:

res
$bidirection
$bidirection$AIC

Call:
lm(formula = mpg ~ 1 + qsec + wt + am, data = data, weights = NULL)

Coefficients:
(Intercept)         qsec           wt           am  
      9.618        1.226       -3.917        2.936  

You can further explore the results with generic functions such as summary(), coeff(), and others. For example:

summary(res$bidirection$AIC)

Call:
lm(formula = mpg ~ 1 + qsec + wt + am, data = data, weights = NULL)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4811 -1.5555 -0.7257  1.4110  4.6610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6178     6.9596   1.382 0.177915    
qsec          1.2259     0.2887   4.247 0.000216 ***
wt           -3.9165     0.7112  -5.507 6.95e-06 ***
am            2.9358     1.4109   2.081 0.046716 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared:  0.8497,    Adjusted R-squared:  0.8336 
F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

You can also visualize the variable selection procedures with:

plot(res, strategy = "bidirection", process = "overview")

plot(res, strategy = "bidirection", process = "details")