-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cases #86
base: gh-pages
Are you sure you want to change the base?
Use cases #86
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank very much. I left a lot of comments.
A very general remark I have which is important for all use cases:
Don't make it more technical than necessary (e.g. don't explain learner APIs)! Just concisely explain what you are doing (without paraphrasing the code). Mlr is (hopefully) intuitive enough that the reader will have no problems to understand the syntax.
src/use_case_classification.Rmd
Outdated
head(data) | ||
``` | ||
|
||
Our aim - as mentioned before - is to predict which kind of people would have survided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
survived
src/use_case_classification.Rmd
Outdated
data[, c("Survived", "Pclass", "Sex", "SibSp", "Embarked")] = lapply(data[, c("Survived", "Pclass", "Sex", "SibSp", "Embarked")], as.factor) | ||
``` | ||
|
||
Next, unuseful columns will be dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unuseful -> useless
src/use_case_classification.Rmd
Outdated
data = dropNamed(data, c("Cabin","PassengerId", "Ticket", "Name")) | ||
``` | ||
|
||
And missing values will be imputed, in this case Age and Fare. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a half sentence to say which imputation methods you use.
src/use_case_classification.Rmd
Outdated
|
||
### Define a learner | ||
|
||
A classification learner is selected. You can find an overview of all learners [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- missing . at the end of the 2nd sentence
- Regarding the first sentence in l. 65: Please give more specific information about what you are doing. To illustrate what I mean (feel free to tweak):
"As this method often works well off the shelf, we use a random forest and set it up to predict class probabilities additional to class labels"
```{r} | ||
n = getTaskSize(task) | ||
trainSet = seq(1, n, by = 2) | ||
testSet = seq(2, n, by = 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use random train and test sets here.
(I know that we use this kind of train/test sets elsewhere in the tutorial, but this is in general bad and I'm going to change that.)
src/use_case_classification.Rmd
Outdated
plotROCCurves(df) | ||
``` | ||
|
||
### Extension of the original use case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This header is too meta. Please make this more specific. You are making predictions on an additional test set, right?
```{r} | ||
data = titanic_train | ||
head(data) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is wrong with titanic_train?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using this data set for training and testing and thought that "data" is not so confusing than titanic_train.
src/use_case_classification.Rmd
Outdated
* train the model, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/train/index.html) | ||
* predict the survival chance, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/predict/index.html) | ||
* validate the model,[here](http://mlr-org.github.io/mlr-tutorial/devel/html/performance/index.html) and [here](http://mlr-org.github.io/mlr-tutorial/devel/html/resample/index.html) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I like this list.
But in my opinion we could make it a little less technical. Things like "create a task" or "provide a learner" won't necessarily mean sth. to someone new to mlr. Therefore, I would change these lines to sth. like "Define the learning task/problem" and "select a learning method"
As you might have seen the titanic library also provides a second dataset. | ||
|
||
```{r} | ||
test = titanic_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the name titanic_test
|
||
test = dropNamed(test, c("Cabin","PassengerId", "Ticket", "Name")) | ||
|
||
test = impute(test, cols = list(Age = imputeMedian(), Fare = imputeMedian())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally happy about how you are doing the imputation.
- For titanic_train you are imputing on the whole data set before splitting it into train and test samples.
- You are doing the imputation separately for titanic_train and titanic_test.
I think the cleanest way would be imputation on the train samples and reimputation (using the estimated means and medians from the train samples) on the test samples, respectively. This is easy to do with an imputation wrapper, but the least beginner-friendly variant.
Therefore, I would stick to approach 2., but use it consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Julia,
I've got your point, but I thought that I already used approach 2 because I've imputaed sperately for test and training?
What should I change?
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe what Julia is saying is both
- Demeaning the whole data set before splitting into test and train gives information about the testing data to your training data.
- Say you had one data point in test and demeaned it, you would have a value of zero.
I believe you can use makeImputeWrapper()
which should handle this for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
sorry for the late reply.
I've got your point, but I thought that I already used approach 2 because I've imputaed sperately for test and training?
To better explain what I mean:
Your use case contains two instances where you train and evaluate a learner.
- You train and evaluate your learner on train and test samples drawn from
data
. - You train on
data
and evaluate your learner ontest
.
The first thing I was trying to say is that your example is not consistent w.r.t. the order of
imputation and splitting:
Situation 1. Here, you first impute the whole data
(l. 49) and then split it into train and test sets (l. 77-78).
Situation 2. You impute data
and test
separately. So in contrast to 1. the whole data set, consisting of data
and test
combined, is split first and then the two parts are
imputed.
What Steve is saying is that the order in Situation 1 is usually discouraged because you are using information from the test sample (by calculating mean and median) for training, which might lead to overoptimistic performance values.
What should I change?
I thought about it a little more and now think it would be easiest to use the imputation wrapper.
Let me know if you need help.
Thanks!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: A more beginne-friendly variant would be (if it's possible)
- to first remove NAs and show train, predict, performance and
- then argue that throwing away information is bad and proceed with the imputation wrapper.
Very cool! Left a few small comments |
|
||
Therefore we will work off the following steps: | ||
|
||
* preprocessing, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/preproc/index.html) and [here](http://mlr-org.github.io/mlr-tutorial/devel/html/impute/index.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I overlooked this during my first review. If you want to link to other tutorial pages please use e.g. [here](preproc.md)
.
|
||
#### Preprocessing | ||
|
||
The data set is corrected regarding their data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would show beforehand that they are bad types and say, "Hey we need to fix this"
data = dropNamed(data, c("Cabin","PassengerId", "Ticket", "Name")) | ||
``` | ||
|
||
And missing values will be imputed. Age and Fare are numerival variables and the missing values will be replaced through the median. Embarked is a character data type and its missing values will be imputed with the mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fare are numerival variables and
typo
|
||
A classification learner is selected. You can find an overview of all learners [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html). | ||
|
||
In this showcase we have a classification problem, so we need to select a learner for this kind of question. As this method often works well off the shelf, we use a random forest and set it up to predict class probabilities additional to class labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary, but something like,
"If you want to try your an alternative learner for predicting probabilities you can check them with listLearners(properties = "prob")[,1:6]
"
might be nice
@juliafried Do you still have motivation to finish this here? 🎉 |
added a first draft of the classification use (titanic data set)
very basic functions -> should I add more advanced ones? e.g. that I compare more learners / models etc.