get_smarties

Like pd.get_dummies... but smarter.

The problem

When working with a categorical dataset, most use the pandas.get_dummies function for easy dummy variable generation. This is well and good, until you have to compare two subsets of your dataset (as in prediction). If your subsets don't have a row for each possible value for some feature, your resulting datasets will be different shapes.

For example, say we have a datset with a 'gender' with two possible values: Male and Female.

	...	gender
1	...	Male
2	...	Female
3	...	Male

The pd.get_dummies function would give you:

	...	gender_Male	gender_Female
1	...	1	0
2	...	0	1
3	...	1	0

But now, say we have another instance and do some machine learning voodoo to predict their gender. Say we predict a male. get_dummies would give:

	...	gender_Male
1	...	1

Since Pandas never saw a Female in this subset, it only generates a category for Male. The result is that your new and original samples have different shapes, making all kinds of trouble for computing loss, for example.

See more discussion of this issue at this thread.

The solution

get_smarties allows you to easily generate dummy variables while persisting the possible values under each category for you. You can use conventional fit_transform and transform methods and solve this problem with virtually no additional effort, like so:

from get_smarties import Smarties
gs = Smarties()

# generate dummies on original dataset, store values for later
X = gs.fit_transform(data)

# generate more dummies on new sample using previously stored values
Y = gs.transform(prediction)

Pipelines

Because get_smarties has fit/transform capabilities, you can even inject your dummy variable creation directly sklearn pipelines:

training_pipeline = Pipeline([
    ('smarties', Smarties()),
    ('clf', MultinomialNB()),
])

training_pipeline.fit(data, labels)

See a working example with k-fold cross validation at kfold-pipeline-demo.ipynb.

Setup

With pip, simply run

pip install -e git+https://github.com/joeddav/get_smarties.git#egg=get_smarties

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LICENSE		LICENSE
README.md		README.md
get_smarties.py		get_smarties.py
kfold-pipeline-demo.ipynb		kfold-pipeline-demo.ipynb
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

get_smarties

The problem

The solution

Pipelines

Setup

About

Releases

Packages

Languages

License

joeddav/get_smarties

Folders and files

Latest commit

History

Repository files navigation

get_smarties

The problem

The solution

Pipelines

Setup

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages