prioritylasso Vignette

Simon Klau

2017-09-25

Introduction

prioritylasso was developed for situations in which different types of high dimensional omics-data in combination with clinical data are available. We want to include these so called blocks of data together in a prediction model. prioritylasso thereby tries to find a good compromise between prediction accuracy and practical aspects. This is done by letting the practitioner include prior knowlegde in the form of priorities, which define the importance of every block. The priorities can be chosen in several different ways.
The model is successively calculated as LASSO-based, while the Linear Predictor after each fit is taken as an Offset in the LASSO Regression of the block with the next lowest priority. The Penalized Regression models are computed via the R package glmnet. Moreover, the R package survival is used when the outcome consists of survival data. In addition to the main function, the package prioritylasso contains an extension of the standard function named cvm_prioritylasso. It is also explained under practical aspects in the following examples.

Getting started with prioritylasso

The data we will use for the examples was simulated and shall represent AML data with 4 blocks of variables. We first have to load the package which contains this data set.

library(prioritylasso)

To take a deeper look at the data, open the description through ?pl_data. The different types of data are all stored together. To get and define the block structure which implies the priorities, we look at the names of the variables and the matrix dimension.

dim(pl_data)
## [1]  400 1029
colnames(pl_data)[1:30]
##  [1] "b1.1"  "b1.2"  "b1.3"  "b1.4"  "b2.1"  "b2.2"  "b2.3"  "b2.4" 
##  [9] "b2.5"  "b3.1"  "b3.2"  "b3.3"  "b3.4"  "b3.5"  "b3.6"  "b3.7" 
## [17] "b3.8"  "b3.9"  "b3.10" "b3.11" "b3.12" "b3.13" "b3.14" "b3.15"
## [25] "b3.16" "b3.17" "b3.18" "b3.19" "b4.1"  "b4.2"
colnames(pl_data)[1025:1029]
## [1] "b4.997"  "b4.998"  "b4.999"  "b4.1000" "pl_out"

We see that the last column contains the outcome. We will re-save the outcome and the matrix of predictors seperately and define the argument blocks afterwards. We will need that for the application of prioritylasso where the indices have to correspond with the variables in the predictor matrix.

pl_out <- pl_data[,1029]
pl_pred <- pl_data[,1:1028]
blocks <- list(bp1=1:4, bp2=5:9, bp3=10:28, bp4=29:1028)

For an easier interpretation, we can consider the first block as clinical variables of most importance, the second block as other clinical data, and the third block as mutations. These blocks are all low-dimensional. That will later be an advantage regarding computation time. Another type of data is high dimensional gene expression data. Moreover, pl_out consists of a binary outcome for 400 subjects. Before we go further we want to split the data into training and validation set. The training set should be used to build the model with prioritylasso while the validation data should later be used to assess the prediction accuracy in an independent data set. Here we just do a random split such that 2/3 of the data belongs to the training set and 1/3 to the validation set.

set.seed(1234)
label <- sample(dim(pl_pred)[1],round(dim(pl_pred)[1]*(2/3)))
pl_train <- pl_pred[label,]
pl_val <- pl_pred[-label,]
pl_out_train <- pl_out[label]
pl_out_val <- pl_out[-label]

We can now run prioritylasso for the first time.

set.seed(1234)
pl1 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", type.measure = "auc", 
                     blocks = blocks, standardize = FALSE)

Because we have a binary outcome, we specified family = "binomial" and type.measure = "auc". The priority structure was given through the block definition in the data set. An extended use of the block definition and the general functionality of prioritylasso will be given in the following sections.

prioritylasso - extended application

Specifying the priorities

If we want to specify another block order, we have to redefine the argument blocks. Let us assume that we do not want to include the variables b1 first in the model, but b3 first, b1 second and b2 third. That would be done by blocks = list(bp1 = 10:28, bp2 = 1:4, bp3 = 5:9, bp4 = 29:1028). So the position on the list indicates when the variables are considered in the model or in other words, the priority. It is not neccessary to name the entries of the list, but it may help in avoiding confusion. We chose the names such that “bp1”" means “block of priority 1”.

set.seed(1234)
pl2 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", type.measure = "auc", 
                     block1.penalization = TRUE, blocks = list(10:28, 1:4, 5:9,  29:1028),
                     standardize = FALSE)
pl2$nzero
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 14

The output lists are to be interpreted according to the definition in blocks. Therefore, the first entry corresponds to the variables of columns 10:28 in the data set.

Set a maximal number of nonzero coefficients.

As we can see in pl2, we have a lot of nonzero coefficients of bp4, although it is the block with lowest priority. In general, there should be a reason why the glmnet procedure chooses these coefficients. On the other hand, it is sometimes not appropriate from a practical point of view that the number exceeds a particular value. Therefore we made it possible to set a maximal number of nonzero coefficients. Because we just want to set a limit for bp4, we set the other entries to Inf.

set.seed(1234)
pl3 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", type.measure = "auc", 
                     block1.penalization = TRUE, blocks = list(10:28, 1:4, 5:9,  29:1028), 
                     max.coef = c(Inf, Inf, Inf, 10), standardize = FALSE)
pl3$nzero
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 9

Include the block with the highest priority without a penalty

We can further diversify the option of whether or not bp1 is included without a penalty in the model. In a not penalized version, the block has to be low-dimensional and the model includes all of its variables. In this case, the model fit is stored in block1unpen. Otherwise, the block is treated like the other blocks and a LASSO model is fitted.

set.seed(1234)
pl4 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", type.measure = "auc", 
                     block1.penalization = FALSE, blocks = list(1:4, 5:9, 10:28, 29:1028), 
                     max.coef = c(Inf, Inf, Inf, 10), standardize = FALSE)
pl4$block1unpen
## 
## Call:  glm(formula = Y ~ X[, blocks[[1]]], family = family)
## 
## Coefficients:
##           b1.1    b1.2    b1.3    b1.4  
## 0.1220  1.2192  1.0789  0.4475  0.3029  
## 
## Degrees of Freedom: 266 Total (i.e. Null);  262 Residual
## Null Deviance:       352.1 
## Residual Deviance: 335.7     AIC: 345.7
pl4$lambda.ind
## [[1]]
## NULL
## 
## [[2]]
## [1] 30
## 
## [[3]]
## [1] 19
## 
## [[4]]
## [1] 20

The first entries of the values which correspond to the blocks list remain empty as you can see for the example of lambda.ind.

Options for the cross validation procedure

nfolds and lambda.type are options which refer to the cross-validation procedure in cv.glmnet. nfolds specifies the number of folds in the procedure while lambda.type handles the amount of cross-validation error. It can be set to either lambda.min, which is also the default, or lambda.1se. lambda.min gives the result with minimium mean cross-validation error, whereas lambda.1se gives the result such that the cross-validation error is within 1 standard error of the minimum, and thus leads to more sparse results. Note that the latter can only be chosen in combination with no restrictions in max.coef.

set.seed(1234)
pl5_min <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", 
                         type.measure = "auc", block1.penalization = TRUE, 
                         blocks = list(1:4, 5:9, 10:28, 29:1028), 
                         lambda.type = "lambda.min", standardize = FALSE)

set.seed(1234)
pl5_1se <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", 
                         type.measure = "auc", block1.penalization = TRUE, 
                         blocks = list(1:4, 5:9, 10:28, 29:1028), 
                         lambda.type = "lambda.1se", standardize = FALSE)

pl5_min$nzero
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 2
## 
## [[4]]
## [1] 11
pl5_1se$nzero
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 0
## 
## [[4]]
## [1] 1

In addition, other options for prioritylasso can be specified. For example, we can change the type of cross validation measurement for binary outcome to the misclassification error. Examples for other outcome variables are shown in the function documentation. Further arguments can be passed to prioritylasso which are used in cv.glmnet, e.g. the elasticnet mixing parameter. Here, we are only using the default value 1 which corresponds to the standard LASSO method, but a parameter alpha between 0 and 1 can be chosen as well.

Some notes on the output

The function used for every LASSO fit is glmnet which creates a sequence of lambda values. The lambda which is chosen according to lambda.type is the lambda on position lambda.ind of the sequence and its value is stored in lambda.min. In general, the lower the value of lambda.ind, the higher the lambda.min and thus the penalization. This leads to more sparse models. The number of lambda values can be chosen with an optional argument. We give the output of pl1 as an example.

pl1$lambda.ind
## [[1]]
## [1] 5
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 5
## 
## [[4]]
## [1] 23
pl1$lambda.min
## [[1]]
## [1] 0.02041998
## 
## [[2]]
## [1] 0.08908832
## 
## [[3]]
## [1] 0.01942847
## 
## [[4]]
## [1] 0.0641479

The values of lambda are generated in every call of glmnet and hence cannot be compared. The corresponding mean cross validated error is stored in the list min.cvm. The interpretation of its values depend on the argument type.measure. In our examples with a binary outcome it is usually the area under the ROC curve (AUC).

pl1$min.cvm
## [[1]]
## [1] 0.6328067
## 
## [[2]]
## [1] 0.6618344
## 
## [[3]]
## [1] 0.6875681
## 
## [[4]]
## [1] 0.7882044

We see that for the example of pl1 it grows with the number of considered modalities. That is what we expected because the more coefficients we have in our model, the better the prediciton should be.

Use of the function cvm_prioritylasso

prioritylasso was generally implemented with the idea, that practitioners have prior knowledge about the data that allows them to specify the priorities. However, it might be, especially in the presence of several blocks, that it is not clear which order of blocks is the best. That is why we implemented cvm_prioritylasso that allows us to define more than one possible list of blocks. The function chooses the best block order according to the mean cross-validated error. Analogously, several vectors with maximal coefficients can be specified in max.coef.list. In the next example we do not know if it is better to include variables b4 before those of b3.

set.seed(1234)
cvm_pl1 <- cvm_prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", 
                             type.measure = "auc", standardize = FALSE, 
                             block1.penalization = FALSE, 
                             blocks.list = list(list(1:4, 5:9, 10:28, 29:1028), 
                                                list(1:4, 5:9, 29:1028, 10:28)), 
                             max.coef.list = list(c(Inf, Inf, Inf, 10), c(Inf, Inf, 10, Inf)))
cvm_pl1$best.blocks
## [1] "bp1 = 1:4"     "bp2 = 5:9"     "bp3 = 10:28"   "bp4 = 29:1028"

We see that the order of blocks which leads to the best result is specified first in blocks.list. It might be inconvenient to define a lot of lists in the block argument. Note that it is not the general idea of prioritylasso to be applied to any order of blocks and fish for the best.

Validation

Now we give a detailed example with a validation.

set.seed(1234)
pl_fit1 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", 
                         type.measure = "auc", blocks = list(1:4, 5:9, 10:28, 29:1028), 
                         block1.penalization = FALSE, max.coef = c(Inf, Inf, Inf, 10), 
                         standardize = FALSE)

The idea before simulating the data was to include it in the order as it was defined in the data matrix. The first block, a block with only 4 variables should not be penalized. Because of prior knowledge (we have it because we generated the data, normally a practitioner could have it, too) we know that all of its variables are relevant. The maximal number of coefficients for the last block is restricted to 10. We can easily extract the coefficients and print those which are nonzero.

coeff1 <- pl_fit1$coefficients
coeff1 <- coeff1[coeff1 != 0]
print(round(coeff1, 4))
##            b1.1    b1.2    b1.3    b1.4    b2.1    b2.3    b2.4    b2.5 
##  0.1220  1.2192  1.0789  0.4475  0.3029  0.1770  0.4689  0.0234 -0.1100 
##    b3.1    b3.2    b3.3    b3.4    b3.5    b3.7    b3.8   b3.11   b3.14 
## -0.4265 -0.1310 -0.2987  0.9576 -0.4942 -0.8483  0.1802 -0.0151 -0.0671 
##   b3.19    b4.1    b4.6   b4.10   b4.29  b4.106  b4.123  b4.269  b4.309 
##  0.3853  0.1346  0.2014  0.1676  0.0037  0.0480  0.0295  0.0611  0.1935 
##  b4.607  b4.623 
##  0.0230  0.0732

We see that every variable from block 1 is nonzero, but we also have variables from every other block in our model.

library(pROC)

The R package pROC gives a lot possibilities to evaluate and visualize the results. First we can calculate the score and the ROC curve in the training set and get an AUC value.

pl1_score <- pl_train[ , names(coeff1)[-1], drop=F] %*% coeff1[-1]
pl1_roc <- roc(factor(pl_out_train), pl1_score[,1])
auc(pl1_roc)
## Area under the curve: 0.855

Of course we can’t use this as a performance measure because it is estimated based on the data that was already used for training the model and is thus biased. A more appropriate value can be obtained with the test set.

val1_score <- pl_val[ , names(coeff1)[-1], drop=F] %*% coeff1[-1]
val1_roc <- roc(factor(pl_out_val), c(val1_score))
auc(val1_roc)
## Area under the curve: 0.7439

To investigate the influence of the priorities on the prediction accuracy, we can calculate the model again with another order of blocks and get the ROC curve in the same way.

set.seed(1234)
pl_fit2 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", 
                         type.measure = "auc", blocks = list(1:4, 10:28, 5:9, 29:1028), 
                         block1.penalization = FALSE, max.coef = c(Inf, Inf, Inf, 10), 
                         standardize = FALSE)

coeff2 <- pl_fit2$coefficients
coeff2 <- coeff2[coeff2 != 0]

val2_score <- pl_val[ , names(coeff2)[-1], drop=F] %*% coeff2[-1]
val2_roc <- roc(factor(pl_out_val), c(val2_score))
auc(val2_roc)
## Area under the curve: 0.739

We can plot the ROC curves and perform a paired test if the two AUC values are equal.

roc.test(val1_roc, val2_roc, paired=TRUE)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  val1_roc and val2_roc
## Z = 0.21872, p-value = 0.8269
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7438554   0.7390361

The test result shows that we cannot reject the hypotheses that the AUC values are equal.

plot.roc(val1_roc, grid=0.1)
plot.roc(val2_roc, col="red", add=TRUE)
legend("bottomright", legend=c("prioritylasso Score 1", "prioritylasso Score 2"),
       col=c("black", "red"), lwd=2)