forked from bips-hb/datatrain_workshop_ml
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-feature-selection-importance.Rmd
301 lines (224 loc) · 10.1 KB
/
04-feature-selection-importance.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: "04: Feature Selection & Importance"
date: "`r Sys.time()`"
output:
html_notebook:
toc: yes
theme: flatly
number_sections: yes
editor_options:
chunk_output_type: console
---
```{r setup}
library(mlr3verse) # All the mlr3 things
library(ggplot2) # For plotting
library(iml) # Interpretability methods
# Spam Task setup
spam_task <- tsk("spam")
set.seed(26)
spam_split <- partition(spam_task, ratio = 2/3)
# # Penguin Task setup
penguins <- na.omit(palmerpenguins::penguins)
penguin_task <- TaskClassif$new(
id = "penguins",
backend = penguins,
target = "species"
)
```
Goals of this part:
1. Introduce feature selection
2. Introduce (some) variable importance measures
3. Compare measures for different settings
# Feature Selection
There is a lot more to cover than we have time for here, see e.g.:
- [mlr3gallery post with example](https://mlr-org.com/gallery/optimization/2020-09-14-mlr3fselect-basic/index.html)
- [mlr3book](https://mlr3book.mlr-org.com/chapters/chapter6/feature_selection.html)
Selecting features with `{mlr3}` is similar to parameter tuning: We need to set
a budget (e.g. 20 evaluations like before) and a criterion (like the AUC.)
with a resampling strategy (here holdout for simplicity).
```{r fselect-instance}
fselect_instance = FSelectInstanceSingleCrit$new(
task = spam_task,
learner = lrn("classif.rpart", predict_type = "prob"),
resampling = rsmp("holdout"),
measure = msr("classif.auc"),
terminator = trm("evals", n_evals = 20)
)
fselect_instance
```
There are multiple feature selection methods available:
- Random Search (`FSelectorRandomSearch`)
- Exhaustive Search (`FSelectorExhaustiveSearch`)
- Sequential Search (`FSelectorSequential`)
- Recursive Feature Elimination (`FSelectorRFE`)
- Design Points (`FSelectorDesignPoints`)
```{r mlr-fselectors}
mlr_fselectors
```
As you might be able to imagine, doing an exhaustive search is not often feasible
when we're working with a lot of features. For a dataset with 10 features,
examining every possible subset of features would yield over 1000 models to evaluate.
You can imagine how feasible that approach would be for genome-wide studies with thousands of variables.
Random search it is, then!
```{r fselector-optimize}
fselector <- fs("random_search")
fselector$optimize(fselect_instance)
```
```{r fselect-results}
fselect_instance$result_feature_set
fselect_instance$result_y
```
```{r fselect-results-long}
as.data.table(fselect_instance$archive)[1:5, ]
```
Similar to the `auto_tuner` we used for parameter tuning, there's also
an `auto_fselector` which basically works the same way, giving us an optimized
learner as a result
```{r auto-fselect}
fselected_rpart <- auto_fselector(
learner = lrn("classif.rpart", predict_type = "prob"),
resampling = rsmp("holdout"),
measure = msr("classif.ce"),
terminator = trm("evals", n_evals = 20),
fselector = fs("random_search")
)
fselected_rpart
```
And of course it should be worth it to compare our variable-selected learner
with a learner that uses all variables, just to make sure we're not sacrificing any predictive performance:
```{r fselect-bm}
design <- benchmark_grid(
tasks = spam_task,
learners = list(
fselected_rpart,
lrn("classif.rpart", predict_type = "prob")
),
resamplings = rsmp("cv", folds = 3)
)
bmr <- benchmark(design)
bmr$aggregate(msr("classif.auc"))
```
# Feature Importance
Interpretability is a huge topic, and there's a lot to explore - we'll cover some basics and if you're interested, you can always find more at
- The [`{iml}` package](https://christophm.github.io/iml/articles/intro.html)
- The [Interpretable Machine Learning book](https://christophm.github.io/interpretable-ml-book)
- The [`{mlr3}` book](https://mlr3book.mlr-org.com/interpretation.html)
Before we get started with the general methods, it should be noted that some learners bring their own method-specific importance measures.
Random Forests (via `{ranger}`) for example has some built-in importance
metrics, like the corrected Gini impurity:
```{r ranger-importance}
lrn_ranger <- lrn("classif.ranger", importance = "impurity_corrected")
lrn_ranger$train(penguin_task)
sort(lrn_ranger$importance(), decreasing = TRUE)
```
Which shows us that for our penguins, `bill_length_mm` is probably the most relevant feature, whereas `body_mass_g` does not turn out to be as important
with regard to species classification.
## Feature Importance with `{iml}`
We can to the same (or a similar) thing with a general approach provided by the `{iml}` package, which lets us analyze any given learner wrapped by `{mlr3}` (and many other models created outside `mlr3`).
It has a similar object-oriented approach as `{mlr3}` with a `Predictor` object
we create using our learner and our data, separately for predictors and target:
```{r iml-predictor}
lrn_ranger <- lrn("classif.ranger", predict_type = "prob")
lrn_ranger$train(penguin_task)
# Get dataset of predictors only
penguin_x <- penguins[, names(penguins) != "species"]
# Assemble Predictor
predictor_rf <- Predictor$new(lrn_ranger, data = penguin_x, y = penguins$species)
```
We can then create a `FeatureImp` object for our importance values:
```{r iml-importance}
penguin_importance <- FeatureImp$new(predictor_rf, loss = "ce")
plot(penguin_importance)
```
This type of feature importance is based on permutation, i.e. how much the prediction
error changes after a feature was randomly shuffled.
It's not the same approach as the one we saw with ranger, but it's model-agnostic,
meaning it can be applied to any method.
In some cases, different methods may disagree.
As an example, let's look at the Bikesharing dataset.
This dataset contains data on the number of bike rentals over an extended period of time --- see also `?mlr3data::bike_sharing`.
Since this is a regression problem, we switch to `regr.ranger` as our learner:
```{r bike-importance-example}
bikeshare <- tsk("bike_sharing")
# Since it's a relatively large task and some of these methods take some time, we downsample the dataset by 50% --- feel free to comment this out at home when you have the time, or reduce it even further if it's still too slow.
bikeshare$filter(partition(bikeshare, 0.5)$train)
# Remove variables of unsupported types (for iml Predictor)
bikeshare$set_col_roles(c("date", "working_day", "holiday"), remove_from = "feature")
# Ranger: Built-in Gini impurity
lrn_ranger <- lrn("regr.ranger", importance = "impurity_corrected")
lrn_ranger$train(bikeshare)
sort(lrn_ranger$importance(), decreasing = TRUE)
# iml: Permutation feature importance
bike_predictor <- Predictor$new(
model = lrn_ranger,
data = bikeshare$data(cols = bikeshare$feature_names),
y = bikeshare$data(cols = "count")
)
# This takes a while (warnings can be ignored)
bike_importance <- FeatureImp$new(bike_predictor, loss = "mse")
# Merging importance results
# Also rescaling importance measures for visual comparison
importance_ranger <- data.frame(
feature = names(lrn_ranger$importance()),
importance = lrn_ranger$importance(),
method = "Gini",
row.names = NULL
)
importance_ranger$importance <- importance_ranger$importance/max(importance_ranger$importance)
importance_iml <- bike_importance$results[c("feature", "importance")]
importance_iml$importance <- importance_iml$importance/max(importance_iml$importance)
importance_iml$method <- "Permutation"
# Binding them together & plotting
importance_df <- rbind(importance_ranger, importance_iml)
ggplot(importance_df, aes(y = reorder(feature, importance), x = importance, fill = method)) +
geom_col(position = "dodge") +
labs(
title = "Feature importance comparison",
subtitle = "Bike Sharing regression task w/ Random Forest",
x = "Importance (Rescaled within method)",
y = "Feature", fill = "Importance Method"
) +
theme_minimal(base_size = 16) +
theme(legend.position = "top")
```
## Your Turn!
So far we've only looked at feature importance based on a Random Forest learner,
but it's also possible that importance scores vary when different learners are applied.
Based on the `{iml}` code above, train two or more different learners on a task
of your choice (`tsk("bike_sharing")`, `tsk("spam")`, ...) and calculate feature importances
for both learners.
Do they agree?
Hint: If you want to train an SVM on the bikeshare dataset, use this set of numeric-only features:
```{r task-numeric-features}
bikeshare$col_roles$feature <- c("hour", "month", "weekday", "year", "humidity", "temperature", "windspeed")
```
Note that `bike_sharing` is a relatively large task, so feature importance calculations might be slow!
## Feature Effects
Getting a number for "is this feature important" is nice, but often we want a better picture of the feature's effect. Think of linear models and how we can interpret $\beta_j$ as the linear relationship between $X_j$ and $Y$ --- often things aren't linear though.
One approach to visualize feature effects is via *Partial Dependence Plots* or preferably via
*Accumulated Local Effect* plots (ALE), which `{iml}` thankfully offers.
Let's recycle our `ranger` learner and plot some effects:
```{r feature-effects-ale}
lrn_ranger <- lrn("regr.ranger")
lrn_ranger$train(bikeshare)
predictor_rf <- Predictor$new(
model = lrn_ranger,
data = bikeshare$data(cols = bikeshare$feature_names),
y = bikeshare$data(cols = "count")
)
# Compute all feature effects (use FeatureEffect$new() for singular effects)
# Per default calculates ALE
ranger_effects <- FeatureEffects$new(predictor_rf)
```
Visualizing the (conditional) feature effect of `hour` or `temperature`
```{r feature-effects-ale-hour-temp}
ranger_effects$plot(feature = "hour")
ranger_effects$plot(feature = "temperature")
```
To only get one effect as PDP or ALE (computationally faster that calculation effects for all features)
```{r feature-effects-pdp-vs-ale}
ranger_effects_rm_ale <- FeatureEffect$new(predictor_rf, feature = "temperature", method = "ale")
ranger_effects_rm_pdp <- FeatureEffect$new(predictor_rf, feature = "temperature", method = "pdp")
ranger_effects_rm_ale$plot() + labs(title = "ALE of temperature on cyclist count")
ranger_effects_rm_pdp$plot() + labs(title = "PDP of temperature on cyclist count")
```