This document demonstrates some basic uses of recipes. First, some definitions are required:

**variables**are the original (raw) data columns in a data frame or tibble. For example, in a traditional formula`Y ~ A + B + A:B`

, the variables are`A`

,`B`

, and`Y`

.**roles**define how variables will be used in the model. Examples are:`predictor`

(independent variables),`response`

, and`case weight`

. This is meant to be open-ended and extensible.**terms**are columns in a design matrix such as`A`

,`B`

, and`A:B`

. These can be other derived entities that are grouped, such as a set of principal components or a set of columns, that define a basis function for a variable. These are synonymous with features in machine learning. Variables that have`predictor`

roles would automatically be main effect terms.

The packages contains a data set used to predict whether a person
will pay back a bank loan. It has 13 predictor columns and a factor
variable `Status`

(the outcome). We will first separate the
data into a training and test set:

```
library(recipes)
library(rsample)
library(modeldata)
data("credit_data")
set.seed(55)
<- initial_split(credit_data)
train_test_split
<- training(train_test_split)
credit_train <- testing(train_test_split) credit_test
```

Note that there are some missing values in these data:

```
vapply(credit_train, function(x) mean(!is.na(x)), numeric(1))
#> Status Seniority Home Time Age Marital Records Job
#> 1.000 1.000 0.998 1.000 1.000 1.000 1.000 0.999
#> Expenses Income Assets Debt Amount Price
#> 1.000 0.910 0.989 0.996 1.000 1.000
```

Rather than remove these, their values will be imputed.

The idea is that the preprocessing operations will all be created using the training set and then these steps will be applied to both the training and test set.

First, we will create a recipe object from the original data and then specify the processing steps.

Recipes can be created manually by sequentially adding roles to variables in a data set.

If the analysis only requires **outcomes** and
**predictors**, the easiest way to create the initial
recipe is to use the standard formula method:

```
<- recipe(Status ~ ., data = credit_train)
rec_obj
rec_obj#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 13
```

The data contained in the `data`

argument need not be the
training set; this data is only used to catalog the names of the
variables and their types (e.g. numeric, etc.).

(Note that the formula method is used here to declare the variables,
their roles and nothing else. If you use inline functions
(e.g. `log`

) it will complain. These types of operations can
be added later.)

From here, preprocessing steps for some step *X* can be added
sequentially in one of two ways:

```
<- step_{X}(rec_obj, arguments) ## or
rec_obj <- rec_obj %>% step_{X}(arguments) rec_obj
```

`step_dummy`

and the other functions will always return
updated recipes.

One other important facet of the code is the method for specifying
which variables should be used in different steps. The manual page
`?selections`

has more details but `dplyr`

-like
selector functions can be used:

- use basic variable names (e.g.
`x1, x2`

), `dplyr`

functions for selecting variables:`contains()`

,`ends_with()`

,`everything()`

,`matches()`

,`num_range()`

, and`starts_with()`

,- functions that subset on the role of the variables that have been
specified so far:
`all_outcomes()`

,`all_predictors()`

,`has_role()`

,

- similar functions for the type of data:
`all_nominal()`

,`all_numeric()`

, and`has_type()`

, or - compound selectors such as
`all_nominal_predictors()`

or`all_numeric_predictors()`

.

Note that the methods listed above are the only ones that can be used to select variables inside the steps. Also, minus signs can be used to deselect variables.

For our data, we can add an operation to impute the predictors. There
are many ways to do this and `recipes`

includes a few steps
for this purpose:

```
grep("impute_", ls("package:recipes"), value = TRUE)
#> [1] "step_impute_bag" "step_impute_knn" "step_impute_linear"
#> [4] "step_impute_lower" "step_impute_mean" "step_impute_median"
#> [7] "step_impute_mode" "step_impute_roll"
```

Here, *K*-nearest neighbor imputation will be used. This works
for both numeric and non-numeric predictors and defaults *K* to
five To do this, it selects all predictors and then removes those that
are numeric:

```
<- rec_obj %>%
imputed step_impute_knn(all_predictors())
imputed#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 13
#>
#> Operations:
#>
#> K-nearest neighbor imputation for all_predictors()
```

It is important to realize that the *specific* variables have
not been declared yet (as shown when the recipe is printed above). In
some preprocessing steps, variables will be added or removed from the
current list of possible variables.

Since some predictors are categorical in nature (i.e. nominal), it
would make sense to convert these factor predictors into numeric dummy
variables (aka indicator variables) using `step_dummy()`

. To
do this, the step selects all non-numeric predictors:

```
<- imputed %>%
ind_vars step_dummy(all_nominal_predictors())
ind_vars#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 13
#>
#> Operations:
#>
#> K-nearest neighbor imputation for all_predictors()
#> Dummy variables from all_nominal_predictors()
```

At this point in the recipe, all of the predictor should be encoded as numeric, we can further add more steps to center and scale them:

```
<- ind_vars %>%
standardized step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
standardized#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 13
#>
#> Operations:
#>
#> K-nearest neighbor imputation for all_predictors()
#> Dummy variables from all_nominal_predictors()
#> Centering for all_numeric_predictors()
#> Scaling for all_numeric_predictors()
```

If these are the only preprocessing steps for the predictors, we can
now estimate the means and standard deviations from the training set.
The `prep`

function is used with a recipe and a data set:

```
<- prep(standardized, training = credit_train)
trained_rec
trained_rec#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 13
#>
#> Training data contained 3340 data points and 322 incomplete rows.
#>
#> Operations:
#>
#> K-nearest neighbor imputation for Seniority, Home, Time, Age, Marital, Records, ... [trained]
#> Dummy variables from Home, Marital, Records, Job [trained]
#> Centering for Seniority, Time, Age, Expenses, Income, Assets,... [trained]
#> Scaling for Seniority, Time, Age, Expenses, Income, Assets,... [trained]
```

Note that the real variables are listed (e.g. `Home`

etc.)
instead of the selectors (`all_numeric_predictors()`

).

Now that the statistics have been estimated, the preprocessing can be
*applied* to the training and test set:

```
<- bake(trained_rec, new_data = credit_train)
train_data <- bake(trained_rec, new_data = credit_test) test_data
```

`bake`

returns a tibble that, by default, includes all of
the variables:

```
class(test_data)
#> [1] "tbl_df" "tbl" "data.frame"
test_data#> # A tibble: 1,114 × 23
#> Seniority Time Age Expen…¹ Income Assets Debt Amount Price Status
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1.09 0.924 1.88 -0.385 -0.131 -0.488 -0.295 -0.0817 0.297 good
#> 2 -0.977 0.924 -0.459 1.77 -0.437 0.845 -0.295 0.333 0.760 good
#> 3 -0.977 0.103 0.349 1.77 -0.783 -0.488 -0.295 0.333 0.00254 bad
#> 4 -0.247 0.103 -0.280 0.231 -0.207 -0.133 -0.295 0.229 0.171 good
#> 5 -0.125 -0.718 -0.729 0.231 -0.258 -0.222 -0.295 -0.807 -0.854 good
#> 6 -0.855 0.924 -0.549 -1.05 -0.0539 -0.488 -0.295 0.436 -0.331 good
#> 7 2.31 0.924 0.349 0.949 -0.0155 -0.488 -0.295 -0.185 0.0475 good
#> 8 0.848 -0.718 0.529 1.00 1.40 -0.133 -0.295 1.58 1.69 good
#> 9 -0.977 -0.718 -1.27 -0.538 -0.246 -0.266 -0.295 -1.32 -1.65 bad
#> 10 -0.855 0.514 -0.100 0.744 -0.540 -0.488 -0.295 -0.185 -0.800 bad
#> # … with 1,104 more rows, 13 more variables: Home_X1 <dbl>, Home_X2 <dbl>,
#> # Home_X3 <dbl>, Home_X4 <dbl>, Home_X5 <dbl>, Marital_X1 <dbl>,
#> # Marital_X2 <dbl>, Marital_X3 <dbl>, Marital_X4 <dbl>, Records_X1 <dbl>,
#> # Job_X1 <dbl>, Job_X2 <dbl>, Job_X3 <dbl>, and abbreviated variable name
#> # ¹Expenses
vapply(test_data, function(x) mean(!is.na(x)), numeric(1))
#> Seniority Time Age Expenses Income Assets Debt
#> 1 1 1 1 1 1 1
#> Amount Price Status Home_X1 Home_X2 Home_X3 Home_X4
#> 1 1 1 1 1 1 1
#> Home_X5 Marital_X1 Marital_X2 Marital_X3 Marital_X4 Records_X1 Job_X1
#> 1 1 1 1 1 1 1
#> Job_X2 Job_X3
#> 1 1
```

Selectors can also be used. For example, if only the predictors are
needed, you can use
`bake(object, new_data, all_predictors())`

.

There are a number of other steps included in the package:

```
#> [1] "step_BoxCox" "step_YeoJohnson"
#> [3] "step_arrange" "step_bagimpute"
#> [5] "step_bin2factor" "step_bs"
#> [7] "step_center" "step_classdist"
#> [9] "step_corr" "step_count"
#> [11] "step_cut" "step_date"
#> [13] "step_depth" "step_discretize"
#> [15] "step_dummy" "step_dummy_extract"
#> [17] "step_dummy_multi_choice" "step_factor2string"
#> [19] "step_filter" "step_filter_missing"
#> [21] "step_geodist" "step_harmonic"
#> [23] "step_holiday" "step_hyperbolic"
#> [25] "step_ica" "step_impute_bag"
#> [27] "step_impute_knn" "step_impute_linear"
#> [29] "step_impute_lower" "step_impute_mean"
#> [31] "step_impute_median" "step_impute_mode"
#> [33] "step_impute_roll" "step_indicate_na"
#> [35] "step_integer" "step_interact"
#> [37] "step_intercept" "step_inverse"
#> [39] "step_invlogit" "step_isomap"
#> [41] "step_knnimpute" "step_kpca"
#> [43] "step_kpca_poly" "step_kpca_rbf"
#> [45] "step_lag" "step_lincomb"
#> [47] "step_log" "step_logit"
#> [49] "step_lowerimpute" "step_meanimpute"
#> [51] "step_medianimpute" "step_modeimpute"
#> [53] "step_mutate" "step_mutate_at"
#> [55] "step_naomit" "step_nnmf"
#> [57] "step_nnmf_sparse" "step_normalize"
#> [59] "step_novel" "step_ns"
#> [61] "step_num2factor" "step_nzv"
#> [63] "step_ordinalscore" "step_other"
#> [65] "step_pca" "step_percentile"
#> [67] "step_pls" "step_poly"
#> [69] "step_poly_bernstein" "step_profile"
#> [71] "step_range" "step_ratio"
#> [73] "step_regex" "step_relevel"
#> [75] "step_relu" "step_rename"
#> [77] "step_rename_at" "step_rm"
#> [79] "step_rollimpute" "step_sample"
#> [81] "step_scale" "step_select"
#> [83] "step_shuffle" "step_slice"
#> [85] "step_spatialsign" "step_spline_b"
#> [87] "step_spline_convex" "step_spline_monotone"
#> [89] "step_spline_natural" "step_spline_nonnegative"
#> [91] "step_sqrt" "step_string2factor"
#> [93] "step_time" "step_unknown"
#> [95] "step_unorder" "step_window"
#> [97] "step_zv"
```

Another type of operation that can be added to a recipes is a
*check*. Checks conduct some sort of data validation and, if no
issue is found, returns the data as-is; otherwise, an error is
thrown.

For example, `check_missing`

will fail if any of the
variables selected for validation have missing values. This check is
done when the recipe is prepared as well as when any data are baked.
Checks are added in the same way as steps:

```
<- trained_rec %>%
trained_rec check_missing(contains("Marital"))
```

Currently, `recipes`

includes:

```
#> [1] "check_class" "check_cols" "check_missing" "check_name"
#> [5] "check_new_data" "check_new_values" "check_range" "check_type"
```