Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cases #86

Open
wants to merge 2 commits into
base: gh-pages
Choose a base branch
from
Open

Use cases #86

wants to merge 2 commits into from

Conversation

juliabrosig
Copy link
Collaborator

added a first draft of the classification use (titanic data set)
very basic functions -> should I add more advanced ones? e.g. that I compare more learners / models etc.

@juliabrosig juliabrosig changed the base branch from master to gh-pages March 8, 2017 15:41
@juliabrosig juliabrosig requested a review from schiffner March 9, 2017 09:34
Copy link
Contributor

@schiffner schiffner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank very much. I left a lot of comments.

A very general remark I have which is important for all use cases:
Don't make it more technical than necessary (e.g. don't explain learner APIs)! Just concisely explain what you are doing (without paraphrasing the code). Mlr is (hopefully) intuitive enough that the reader will have no problems to understand the syntax.

head(data)
```

Our aim - as mentioned before - is to predict which kind of people would have survided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

survived

data[, c("Survived", "Pclass", "Sex", "SibSp", "Embarked")] = lapply(data[, c("Survived", "Pclass", "Sex", "SibSp", "Embarked")], as.factor)
```

Next, unuseful columns will be dropped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unuseful -> useless

data = dropNamed(data, c("Cabin","PassengerId", "Ticket", "Name"))
```

And missing values will be imputed, in this case Age and Fare.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a half sentence to say which imputation methods you use.


### Define a learner

A classification learner is selected. You can find an overview of all learners [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • missing . at the end of the 2nd sentence
  • Regarding the first sentence in l. 65: Please give more specific information about what you are doing. To illustrate what I mean (feel free to tweak):
    "As this method often works well off the shelf, we use a random forest and set it up to predict class probabilities additional to class labels"

```{r}
n = getTaskSize(task)
trainSet = seq(1, n, by = 2)
testSet = seq(2, n, by = 2)
Copy link
Contributor

@schiffner schiffner Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use random train and test sets here.
(I know that we use this kind of train/test sets elsewhere in the tutorial, but this is in general bad and I'm going to change that.)

plotROCCurves(df)
```

### Extension of the original use case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header is too meta. Please make this more specific. You are making predictions on an additional test set, right?

```{r}
data = titanic_train
head(data)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with titanic_train?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using this data set for training and testing and thought that "data" is not so confusing than titanic_train.

* train the model, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/train/index.html)
* predict the survival chance, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/predict/index.html)
* validate the model,[here](http://mlr-org.github.io/mlr-tutorial/devel/html/performance/index.html) and [here](http://mlr-org.github.io/mlr-tutorial/devel/html/resample/index.html)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like this list.
But in my opinion we could make it a little less technical. Things like "create a task" or "provide a learner" won't necessarily mean sth. to someone new to mlr. Therefore, I would change these lines to sth. like "Define the learning task/problem" and "select a learning method"

As you might have seen the titanic library also provides a second dataset.

```{r}
test = titanic_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the name titanic_test


test = dropNamed(test, c("Cabin","PassengerId", "Ticket", "Name"))

test = impute(test, cols = list(Age = imputeMedian(), Fare = imputeMedian()))
Copy link
Contributor

@schiffner schiffner Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally happy about how you are doing the imputation.

  1. For titanic_train you are imputing on the whole data set before splitting it into train and test samples.
  2. You are doing the imputation separately for titanic_train and titanic_test.

I think the cleanest way would be imputation on the train samples and reimputation (using the estimated means and medians from the train samples) on the test samples, respectively. This is easy to do with an imputation wrapper, but the least beginner-friendly variant.

Therefore, I would stick to approach 2., but use it consistently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Julia,
I've got your point, but I thought that I already used approach 2 because I've imputaed sperately for test and training?
What should I change?
Thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe what Julia is saying is both

  1. Demeaning the whole data set before splitting into test and train gives information about the testing data to your training data.
  2. Say you had one data point in test and demeaned it, you would have a value of zero.

I believe you can use makeImputeWrapper() which should handle this for you

Copy link
Contributor

@schiffner schiffner Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

sorry for the late reply.

I've got your point, but I thought that I already used approach 2 because I've imputaed sperately for test and training?

To better explain what I mean:
Your use case contains two instances where you train and evaluate a learner.

  1. You train and evaluate your learner on train and test samples drawn from data.
  2. You train on data and evaluate your learner on test.

The first thing I was trying to say is that your example is not consistent w.r.t. the order of
imputation and splitting:
Situation 1. Here, you first impute the whole data (l. 49) and then split it into train and test sets (l. 77-78).
Situation 2. You impute data and test separately. So in contrast to 1. the whole data set, consisting of data and test combined, is split first and then the two parts are
imputed.

What Steve is saying is that the order in Situation 1 is usually discouraged because you are using information from the test sample (by calculating mean and median) for training, which might lead to overoptimistic performance values.

What should I change?

I thought about it a little more and now think it would be easiest to use the imputation wrapper.

Let me know if you need help.
Thanks!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: A more beginne-friendly variant would be (if it's possible)

  • to first remove NAs and show train, predict, performance and
  • then argue that throwing away information is bad and proceed with the imputation wrapper.

@SteveBronder
Copy link
Collaborator

Very cool! Left a few small comments


Therefore we will work off the following steps:

* preprocessing, [here](http://mlr-org.github.io/mlr-tutorial/devel/html/preproc/index.html) and [here](http://mlr-org.github.io/mlr-tutorial/devel/html/impute/index.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked this during my first review. If you want to link to other tutorial pages please use e.g. [here](preproc.md).


#### Preprocessing

The data set is corrected regarding their data types.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would show beforehand that they are bad types and say, "Hey we need to fix this"

data = dropNamed(data, c("Cabin","PassengerId", "Ticket", "Name"))
```

And missing values will be imputed. Age and Fare are numerival variables and the missing values will be replaced through the median. Embarked is a character data type and its missing values will be imputed with the mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fare are numerival variables and

typo


A classification learner is selected. You can find an overview of all learners [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html).

In this showcase we have a classification problem, so we need to select a learner for this kind of question. As this method often works well off the shelf, we use a random forest and set it up to predict class probabilities additional to class labels.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but something like,

"If you want to try your an alternative learner for predicting probabilities you can check them with listLearners(properties = "prob")[,1:6]"

might be nice

@pat-s
Copy link
Collaborator

pat-s commented Jun 21, 2018

@juliafried Do you still have motivation to finish this here? 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants