Stable Iterative Variable Selection (SIVS) is a feature selection method that wraps machine learning methods with internal feature selection (e.g shrinkage methods such as LASSO). In this vignette we demonstrate simple usage of this method. We would like to encourage users to always stay critical of the results they get and not use SIVS as a blackbox method.
For the purpose of this simple tutorial, we use Arcene Data Set which is a sample dataset to demonstrate feature selection methods via a binary classification. As the webpage states:
ARCENE’s task is to distinguish cancer versus normal patterns from mass-spectrometric data.
train_x <- read.csv(file = "https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_train.data", header = F, sep = " ", strip.white = T, blank.lines.skip = T, stringsAsFactors = F)
train_y <- read.csv(file = "https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_train.labels", header = F, sep = " ", strip.white = T, blank.lines.skip = T, stringsAsFactors = F)
validation_x <- read.csv(file = "https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_valid.data", header = F, sep = " ", strip.white = T, blank.lines.skip = T, stringsAsFactors = F)
validation_y <- read.csv(file = "https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/arcene_valid.labels", header = F, sep = " ", strip.white = T, blank.lines.skip = T, stringsAsFactors = F)
train_y <- train_y[, 1]
validation_y <- validation_y[, 1]
Check for possible missing values in the data:
column_index | column_name | number_of_NAs | ratio_of_NA |
---|---|---|---|
10001 | V10001 | 100 | 1 |
column_index | column_name | number_of_NAs | ratio_of_NA |
---|---|---|---|
10001 | V10001 | 100 | 1 |
There is a column (V10001) that has 100% NA
in each of
the two datasets (probably because the data is not CSV and some
white-space issues). Let’s remove that and move on:
Now let’s have a better understanding of what variables we are dealing with in this R session:
name | class | size | detail |
---|---|---|---|
train_x | data.frame | 5 Mb | dimension: 100, 10000 |
validation_x | data.frame | 5 Mb | dimension: 100, 10000 |
train_y | integer | 448 bytes | length: 100 |
validation_y | integer | 448 bytes | length: 100 |
Now we can run SIVS to get the most important features. Note that for
the sake of making the vignette, the progressbar is set to FALSE and the
verbosity has set to none so that the function does not print its
progress. It’s worth noting that due to the nature of this feature
selection method, running a high-dimensional data in SIVS would take
time, therefore, the more CPU core you allocate to it, the faster it
will do the job. The following would take about 19 minutes with 2 CPU
cores (parallel.cores = 2
) and it will be much faster if
you use parallel.cores = "grace"
which will use all CPU
cores except 1.
library("sivs")
sivs_obj <- sivs::sivs(x = train_x, # it should be a matrix or dataframe with features as columns and samples as rows
y = factor(train_y),
verbose = "none", # This is for demonstration, you leave the verbose to be "general"
progressbar = FALSE) # This is for demonstration, you leave the progressbar on
By providing factor with 2 levels, we are implying to use binomial.
Alternatively, you can define family = "binomial"
and it
will be passed to internal method (which the default is glmnet).
Having the SIVS object, we can plot the object to have better understanding of what it has found:
The top-left plot shows the how many times each feature was selected
in the iterative step. The top-right plot shows the coefficient
distribution of each of the features selected in the iterative step.
This plot is sorted based on the median of the coefficient of each
feature. Lastly, the bottom plot shows the result of recursive feature
elimination (rfe) step. Each blue bar shows the importance of that
feature, whereas each boxplot shows the distribution of areas under the
receiver operating characteristic curves (AUROC) after removal of that
feature. The more we move towards left-hand side of the plot the more
features we have removed and as expected, the AUROCs start decreasing.
The read vertical line is the suggested cutoffs based on the
sivs::suggest()
function when provided by the
sivs_obj
. Note that In this example due to the data we see
only the 0.05 suggested threshold, but in datasets with more difficult
setup that have more of less-important features, we would also see
another suggestion as a vertical red line for 0.01 cutoff. These cutoffs
are suggestions and users are encourage to treat them as such. These two
cutoffs are pre-defined in the function as loose cutoffs and we
encourage users to investigate the suggested features and make the
cutoff more strict is possible according to their data and setup and the
question they are trying to answer.
Finally, we can extract the list of suggested features by SIVS. Considering that we cannot see 0.01 line in the bottom plot of the figure above and the 0.05 is where the boxplots start dropping, let’s choose a stricter threshold. The strictness can be between 0 (very loose) to 1 (very strict). Here for the sake of the example, we use 0.5 as it is the exact middle of the strictness range:
Ultimately, one can compare the performance of a machine learning
method with and without SIVS. In the following example we use
glmnet
package as in this vignette it was used by SIVS as
the internal method.
library("glmnet")
#> Loading required package: Matrix
#> Loaded glmnet 4.1-8
library("pROC")
#> Type 'citation("pROC")' for a citation.
#>
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#>
#> cov, smooth, var
# build a model without SIVS
set.seed(12345)
glmnet_model <- glmnet::cv.glmnet(x = data.matrix(train_x),
y = factor(train_y),
family = "binomial")
# build a model with SIVS
set.seed(12345)
sivs_glmnet_model <- glmnet::cv.glmnet(x = data.matrix(train_x[, sivs::suggest(sivs_obj, strictness = 0.5)]),
y = factor(train_y),
family = "binomial")
# predict both training set and validation sets
glmnet_train_pred <- predict(object = glmnet_model,
newx = data.matrix(train_x),
s = "lambda.min",
type = "response")
glmnet_validation_pred <- predict(object = glmnet_model,
newx = data.matrix(validation_x),
s = "lambda.min",
type = "response")
sivs_glmnet_train_pred <- predict(object = sivs_glmnet_model,
newx = data.matrix(train_x[, sivs::suggest(sivs_obj, strictness = 0.5)]),
s = "lambda.min",
type = "response")
sivs_glmnet_validation_pred <- predict(object = sivs_glmnet_model,
newx = data.matrix(validation_x[, sivs::suggest(sivs_obj, strictness = 0.5)]),
s = "lambda.min",
type = "response")
glmnet_train_roc <- pROC::roc(response = factor(train_y),
predictor = as.numeric(glmnet_train_pred))
#> Setting levels: control = -1, case = 1
#> Setting direction: controls < cases
glmnet_validation_roc <- pROC::roc(response = factor(validation_y),
predictor = as.numeric(glmnet_validation_pred))
#> Setting levels: control = -1, case = 1
#> Setting direction: controls < cases
sivs_glmnet_train_roc <- pROC::roc(response = factor(train_y),
predictor = as.numeric(sivs_glmnet_train_pred))
#> Setting levels: control = -1, case = 1
#> Setting direction: controls < cases
sivs_glmnet_validation_roc <- pROC::roc(response = factor(validation_y),
predictor = as.numeric(sivs_glmnet_validation_pred))
#> Setting levels: control = -1, case = 1
#> Setting direction: controls < cases
layout(mat = matrix(1:2, nrow = 2))
{
plot(glmnet_train_roc, col = "salmon", main = "Performance on training data")
plot(sivs_glmnet_train_roc, col = "cornflowerblue", add = T)
legend("bottomright",
fill = c("salmon", "cornflowerblue"),
legend = c(paste0("glmnet (AUROC=",
round(pROC::auc(glmnet_train_roc),
digits = 4),
")"),
paste0("SIVS + glmnet (AUROC=",
round(pROC::auc(sivs_glmnet_train_roc),
digits = 4),
")")))
plot(glmnet_validation_roc, col = "salmon", main = "Performance on validation data")
plot(sivs_glmnet_validation_roc, col = "cornflowerblue", add = T)
legend("bottomright",
fill = c("salmon", "cornflowerblue"),
legend = c(paste0("glmnet (AUROC=",
round(pROC::auc(glmnet_validation_roc),
digits = 4),
")"),
paste0("SIVS + glmnet (AUROC=",
round(pROC::auc(sivs_glmnet_validation_roc),
digits = 4),
")")))
layout(1)
}
As we can see, the model build without SIVS has AUROC of 74.88% on validation set where as the model built based on the features selected by SIVS has AUROC of 72.73%. Now let’s compare the number of selected features:
# extract the coefficients
sum(coef(glmnet_model) != 0)
#> [1] 29
sum(coef(sivs_glmnet_model) != 0)
#> [1] 5
This shows that without SIVS we would have selected 29 features, but with SIVS we have reduced it to just 5.