06-exercises.Rmd

# Exercises {#exercises}

Now that you are familiar with the basic steps for supervised machine learning, you can get your hands on the data yourself and implement code for addressing the modelling task outlined in Chapter \@ref(motivation). 

## Reading and cleaning

Read the CSV file `"./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"`, select all variables with name ending with `"_F"`, the variables `"TIMESTAMP"`, `"GPP_NT_VUT_REF"`, and `"NEE_VUT_REF_QC"`, and drop all variables that contain `"JSB"` in their name. Then convert the variable `"TIMESTAMP"` to a date-time object with the function `ymd()` from the *lubridate* package, and interpret all values `-9999` as missing values. Then, set all values of `"GPP_NT_VUT_REF"` to missing if the corresponding quality control variable indicates that less than 90% are measured data points. Finally, drop the variable `"NEE_VUT_REF_QC"` - we won't use it anymore.

```{r}
## write your code here
```

## Data splitting

Split the data a training and testing set, that contain 70% and 30% of the total available data, respectively.

```{r}
## write your code here
```


## Linear model

### Training

Fit a linear regression model using the base-R function `lm()` and the training set. The target variable is `"GPP_NT_VUT_REF"`, and predictor variables are all available meterological variables in the dataset. Answer the following questions:

- What is the $R^2$ of predicted vs. observed `"GPP_NT_VUT_REF"`?
- Is the linear regression slope significantly different from zero for all predictors?
- Is a linear regression model with "poor" predictors removed better supported by the data than a model including all predictors?

```{r}
## write your code here
```

Use caret and the function `train()` for fitting the same linear regression model (with all predictors) on the same data. Does it yield identical results as using `lm()` directly? You will have to set the argument `trControl` accordingly to avoid resampling, and instead fit the model on the all data in `ddf_train`. You can use `summary()` also on the object returned by the function `train()`.
```{r}
## write your code here
```


### Prediction

With the model containing all predictors and fitted on `ddf_train`, make predictions using first `ddf_train` and then `ddf_test`. Compute the $R^2$ and the root-mean-square error, and visualise modelled vs. observed values to evaluate both predictions. 

Do you expect the linear regression model trained on `ddf_train` to predict substantially better on `ddf_train` than on `ddf_test`? Why (not)?

Hints:

- To calculate predictions, use the generic function `predict()` with the argument `newdata = ...`. 
- The $R^2$ can be extracted from the model object as `summary(model_object)$r.squared`, or is (as the RMSE) given in the metrics data frame returned by `metrics()` from the *yardstick* library. 
- For visualisation the model performance, consider a scatterplot, or (better) a plot that reveals the density of overlapping points. (We're plotting information from over 4000 data points here!)

```{r}
## write your code here
```

## KNN

### Check data

```{r}
## write your code here
```

The variable `PA_F` looks weird and was not significant in the linear model. Therefore, we won't use it for the models below.

### Training

Fit two KNN models on `ddf_train` (excluding `"PA_F"`), one with $k = 2$ and one with $k = 30$, both without resampling. Use the RMSE as the loss function. Center and scale data as part of the pre-processing and model formulation specification using the function `recipe()`.

```{r}
## write your code here
```

### Prediction

With the two models fitted above, predict `"GPP_NT_VUT_REF"` for both and training and the testing sets, and evaluate them as above (metrics and visualisation). 

Which model do you expect to perform better on the training set and which to perform better on the testing set? Why? Do you find evidence for overfitting in any of the models?

```{r}
## write your code here
```

### Sample hyperparameters

Train a KNN model with hyperparameter ($k$) tuned, and with five-fold cross validation, using the training set. As the loss function, use RMSE. Sample the following values for $k$: 2, 5, 10, 15, 18, 20, 22, 24, 26, 30, 35, 40, 60, 100. Visualise the RMSE as a function of $k$.

Hint:

- The visualisation of cross-validation results can be visualised with the `plot(model_object)` of `ggplot(model_object)`.

```{r}
## write your code here
```


## Random forest

### Training

Fit a random forest model with `ddf_train` and all predictors excluding `"PA_F"` and five-fold cross validation. Use RMSE as the loss function.

Hints:

- Use the package *ranger* which implements the random forest algorithm. 
- See [here](https://topepo.github.io/caret/available-models.html) for information about hyperparameters available for tuning with caret.
- Set the argument `savePredictions = "final"` of function `trainControl()`.

```{r}
## write your code here
```

### Prediction

Evaluate the trained model on the training and on the test set, giving metrics and a visualisation as above.

How are differences in performance to be interpreted? Compare the performances of linear regression, KNN, and random forest, considering the evaluation on the test set.

```{r}
## write your code here
```

Show the model performance (metrics and visualisation) on the validation sets all cross validation folds combined.

Do you expect it to be more similar to the model performance on the training set or the testing set in the evaluation above? Why?

```{r}
## write your code here
```