Use cases #86

juliabrosig · 2017-03-08T15:39:42Z

added a first draft of the classification use (titanic data set)
very basic functions -> should I add more advanced ones? e.g. that I compare more learners / models etc.

schiffner

Thank very much. I left a lot of comments.

A very general remark I have which is important for all use cases:
Don't make it more technical than necessary (e.g. don't explain learner APIs)! Just concisely explain what you are doing (without paraphrasing the code). Mlr is (hopefully) intuitive enough that the reader will have no problems to understand the syntax.

schiffner · 2017-03-08T19:15:02Z

src/use_case_classification.Rmd

+head(data)
+```
+
+Our aim - as mentioned before - is to predict which kind of people would have survided.


schiffner · 2017-03-08T19:16:33Z

src/use_case_classification.Rmd

+data[, c("Survived", "Pclass", "Sex", "SibSp", "Embarked")] = lapply(data[,  c("Survived", "Pclass", "Sex", "SibSp", "Embarked")], as.factor)
+```
+
+Next, unuseful columns will be dropped.


unuseful -> useless

schiffner · 2017-03-08T19:17:42Z

src/use_case_classification.Rmd

+data = dropNamed(data, c("Cabin","PassengerId", "Ticket", "Name"))
+```
+
+And missing values will be imputed, in this case Age and Fare.


Please also add a half sentence to say which imputation methods you use.

schiffner · 2017-03-08T19:19:55Z

src/use_case_classification.Rmd

+
+### Define a learner
+
+A classification learner is selected. You can find an overview of all learners [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html)


missing . at the end of the 2nd sentence

Regarding the first sentence in l. 65: Please give more specific information about what you are doing. To illustrate what I mean (feel free to tweak):
"As this method often works well off the shelf, we use a random forest and set it up to predict class probabilities additional to class labels"

schiffner · 2017-03-08T19:25:01Z

src/use_case_classification.Rmd

+```{r}
+n = getTaskSize(task)
+trainSet = seq(1, n, by = 2)
+testSet = seq(2, n, by = 2)


Please use random train and test sets here.
(I know that we use this kind of train/test sets elsewhere in the tutorial, but this is in general bad and I'm going to change that.)

schiffner · 2017-03-09T12:16:09Z

src/use_case_classification.Rmd

+plotROCCurves(df)
+```
+
+### Extension of the original use case


This header is too meta. Please make this more specific. You are making predictions on an additional test set, right?

schiffner · 2017-03-09T12:20:57Z

src/use_case_classification.Rmd

+```{r}
+data = titanic_train
+head(data)
+```


What is wrong with titanic_train?

I'm using this data set for training and testing and thought that "data" is not so confusing than titanic_train.

schiffner · 2017-03-09T12:27:18Z

src/use_case_classification.Rmd

+* train the model, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/train/index.html)
+* predict the survival chance, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/predict/index.html)
+* validate the model,[here](http://mlr-org.github.io/mlr-tutorial/devel/html/performance/index.html) and [here](http://mlr-org.github.io/mlr-tutorial/devel/html/resample/index.html)
+


In general I like this list.
But in my opinion we could make it a little less technical. Things like "create a task" or "provide a learner" won't necessarily mean sth. to someone new to mlr. Therefore, I would change these lines to sth. like "Define the learning task/problem" and "select a learning method"

schiffner · 2017-03-09T12:32:34Z

src/use_case_classification.Rmd

+As you might have seen the titanic library also provides a second dataset.
+
+```{r}
+test = titanic_test


I would keep the name titanic_test

schiffner · 2017-03-09T12:49:43Z

src/use_case_classification.Rmd

+
+test = dropNamed(test, c("Cabin","PassengerId", "Ticket", "Name"))
+
+test = impute(test, cols = list(Age = imputeMedian(), Fare = imputeMedian()))


I'm not totally happy about how you are doing the imputation.

For titanic_train you are imputing on the whole data set before splitting it into train and test samples.

You are doing the imputation separately for titanic_train and titanic_test.

I think the cleanest way would be imputation on the train samples and reimputation (using the estimated means and medians from the train samples) on the test samples, respectively. This is easy to do with an imputation wrapper, but the least beginner-friendly variant.

Therefore, I would stick to approach 2., but use it consistently.

Hi Julia,
I've got your point, but I thought that I already used approach 2 because I've imputaed sperately for test and training?
What should I change?
Thank you!

I believe what Julia is saying is both

Demeaning the whole data set before splitting into test and train gives information about the testing data to your training data.

Say you had one data point in test and demeaned it, you would have a value of zero.

I believe you can use makeImputeWrapper() which should handle this for you

Hi,

sorry for the late reply.

I've got your point, but I thought that I already used approach 2 because I've imputaed sperately for test and training?

To better explain what I mean:
Your use case contains two instances where you train and evaluate a learner.

You train and evaluate your learner on train and test samples drawn from data.

You train on data and evaluate your learner on test.

The first thing I was trying to say is that your example is not consistent w.r.t. the order of
imputation and splitting:
Situation 1. Here, you first impute the whole data (l. 49) and then split it into train and test sets (l. 77-78).
Situation 2. You impute data and test separately. So in contrast to 1. the whole data set, consisting of data and test combined, is split first and then the two parts are
imputed.

What Steve is saying is that the order in Situation 1 is usually discouraged because you are using information from the test sample (by calculating mean and median) for training, which might lead to overoptimistic performance values.

What should I change?

I thought about it a little more and now think it would be easiest to use the imputation wrapper.

Let me know if you need help.
Thanks!!

EDIT: A more beginne-friendly variant would be (if it's possible)

to first remove NAs and show train, predict, performance and

then argue that throwing away information is bad and proceed with the imputation wrapper.

SteveBronder · 2017-03-11T01:56:26Z

Very cool! Left a few small comments

schiffner · 2017-03-13T13:09:42Z

src/use_case_classification.Rmd

+
+Therefore we will work off the following steps:
+
+* preprocessing, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/preproc/index.html) and [here](http://mlr-org.github.io/mlr-tutorial/devel/html/impute/index.html)


I overlooked this during my first review. If you want to link to other tutorial pages please use e.g. [here](preproc.md).

SteveBronder · 2017-03-11T01:49:21Z

src/use_case_classification.Rmd

+
+#### Preprocessing
+
+The data set is corrected regarding their data types.


I would show beforehand that they are bad types and say, "Hey we need to fix this"

SteveBronder · 2017-03-11T01:49:52Z

src/use_case_classification.Rmd

+data = dropNamed(data, c("Cabin","PassengerId", "Ticket", "Name"))
+```
+
+And missing values will be imputed. Age and Fare are numerival variables and the missing values will be replaced through the median. Embarked is a character data type and its missing values will be imputed with the mode.


Fare are numerival variables and

typo

SteveBronder · 2017-03-11T01:53:20Z

src/use_case_classification.Rmd

+
+A classification learner is selected. You can find an overview of all learners [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html).
+
+In this showcase we have a classification problem, so we need to select a learner for this kind of question. As this method often works well off the shelf, we use a random forest and set it up to predict class probabilities additional to class labels.


Not necessary, but something like,

"If you want to try your an alternative learner for predicting probabilities you can check them with listLearners(properties = "prob")[,1:6]"

might be nice

pat-s · 2018-06-21T18:14:13Z

@juliafried Do you still have motivation to finish this here? 🎉

added first draft of classification use case

5b6bcc9

juliabrosig changed the base branch from master to gh-pages March 8, 2017 15:41

juliabrosig requested a review from schiffner March 9, 2017 09:34

schiffner suggested changes Mar 9, 2017

View reviewed changes

changed use case titanic #86 regarding to comments from 09.03.2017

7c882d1

schiffner mentioned this pull request Mar 10, 2017

the the quickstart page really sucks. should we change it? #56

Closed

schiffner reviewed Mar 13, 2017

View reviewed changes

SteveBronder reviewed Mar 13, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cases #86

Use cases #86

juliabrosig commented Mar 8, 2017

schiffner left a comment

schiffner Mar 8, 2017

schiffner Mar 8, 2017

schiffner Mar 8, 2017

schiffner Mar 8, 2017

schiffner Mar 8, 2017 •

edited

Loading

schiffner Mar 9, 2017

schiffner Mar 9, 2017

juliabrosig Mar 10, 2017

schiffner Mar 9, 2017

schiffner Mar 9, 2017

schiffner Mar 9, 2017 •

edited

Loading

juliabrosig Mar 10, 2017

SteveBronder Mar 11, 2017

schiffner Mar 13, 2017 •

edited

Loading

schiffner Mar 13, 2017

SteveBronder commented Mar 11, 2017

schiffner Mar 13, 2017

SteveBronder Mar 11, 2017

SteveBronder Mar 11, 2017

SteveBronder Mar 11, 2017

pat-s commented Jun 21, 2018


		### Define a learner

		A classification learner is selected. You can find an overview of all learners [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html)


		test = dropNamed(test, c("Cabin","PassengerId", "Ticket", "Name"))

		test = impute(test, cols = list(Age = imputeMedian(), Fare = imputeMedian()))


		Therefore we will work off the following steps:

		* preprocessing, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/preproc/index.html) and [here](http://mlr-org.github.io/mlr-tutorial/devel/html/impute/index.html)


		#### Preprocessing

		The data set is corrected regarding their data types.


		A classification learner is selected. You can find an overview of all learners [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html).

		In this showcase we have a classification problem, so we need to select a learner for this kind of question. As this method often works well off the shelf, we use a random forest and set it up to predict class probabilities additional to class labels.

Use cases #86

Are you sure you want to change the base?

Use cases #86

Conversation

juliabrosig commented Mar 8, 2017

schiffner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schiffner Mar 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schiffner Mar 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schiffner Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteveBronder commented Mar 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pat-s commented Jun 21, 2018

schiffner Mar 8, 2017 •

edited

Loading

schiffner Mar 9, 2017 •

edited

Loading

schiffner Mar 13, 2017 •

edited

Loading