auto extend mixin classes

Automatically load the RandomForestExplainer and XGBExplainer mixin classes, whenever a RandomForest of xgboost model is passed.
oegedijk · Sep 27, 2020 · 6af36e7 · 6af36e7
1 parent 40e6ad0
commit 6af36e7
Show file tree

Hide file tree

Showing 10 changed files with 145 additions and 42 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -11,23 +11,22 @@
     XGBCLassifierExplainer and XGBRegressionExplainer. 
 - new parameter n_jobs for calculations that can be parallelized (e.g. permutation importances)
 - contrib_df, plot_shap_contributions: can order by global shap feature 
-    'importance' (as well as 'abs', 'high-to-low', 'low-to-high')
+    'importance' (as well as 'abs', 'high-to-low' and 'low-to-high')
 - added actual outcome to plot_trees
 
 ### Bug Fixes
 -
 -
 
 ### Improvements
-
-- added selenium integration tests for dashboards (also working with github actions)
-- added tests for multiclass classsification, DecisionTree and ExtraTrees models
-- added proper docstrings to explainer_methods.py
 - optimized code for calculating permutation importance, adding possibility to calculate in parallel
 - shap dependence component: if no color col selected, output standard blue dots instead of ignoring update
 
 ### Other Changes
--
+- added selenium integration tests for dashboards (also working with github actions)
+- added tests for multiclass classsification, DecisionTree and ExtraTrees models
+- added tests for XGBExplainers
+- added proper docstrings to explainer_methods.py
 
 ## Version 0.2.2
 

diff --git a/TODO.md b/TODO.md
@@ -27,7 +27,6 @@
 - add target name 
 - add plain language explanations
 
-
 ## notebooks:
 - add binder/colab links on github
 
@@ -48,11 +47,9 @@
 - test model_output='probability' and 'raw' or 'logodds' seperately
 - write tests for explainer_methods
 - write tests for explainer_plots
-- write tests for XGBoostExplainers
 - add test coverage 
 
 ## Docs:
-- add documentation for XGBExplainers
 - add docstrings to explainer_plots
 - add screenshots of components to docs
 - move screenshots to separate folder
@@ -62,8 +59,6 @@
 
 
 ## Library level:
-- Add launch from colab option (mode='colab'?):
-    - https://amitness.com/2020/06/google-colaboratory-tips/?s=03
 - Add Altair (vega) plots for easy inclusion in websites or fastpages blogs
 - Long term: add option to load from directory with pickled model, data csv and config file
 - add more screenshots to README with https://postimages.org/

diff --git a/docs/.DS_Store b/docs/.DS_Store
diff --git a/docs/source/dashboards.rst b/docs/source/dashboards.rst
@@ -94,22 +94,23 @@ However it would be easy to turn this custom ``FeatureListTab`` into a proper
 Starting a multitab dashboard
 -----------------------------
 
-Besided using the booleans as described above, you can also pass a list of 
+Besided the single page dashboard above you can also pass a list of 
 ``ExplainerComponents`` to construct multiple tabs. These can be a mix of 
 the different types discussed above. E.g.::
 
    ExplainerDashboard(explainer, [ImportancesTab, imp_tab, "importances", features]).run()
 
 This would start a dashboard with three importances tabs, plus our custom 
-feature list tab. (not sure why you would do that, but you get the point :)
+feature list tab. (not sure why you would do that, but hopefully you get the point :)
 
 
-Using Dash or JupyterDash
--------------------------
+Using explainerdashboard inside Jupyter notebook or google colab
+----------------------------------------------------------------
 
 You can start the dashboard with the standard ``dash.Dash()`` server or with the 
-new notebook friendly ``jupyter_dash`` library server. The latter will allow you
+new notebook friendly ``jupyter_dash`` server. The latter will allow you
 to keep working interactively in your notebook while the dashboard is running.
+Also, this allows you to run an explainerdashboard from within google colab!
 
 The default dash server is started with ``mode='dash'``. There are three different 
 options for ``jupyter_dash``: ``mode='inline'`` for running the dashboard in an 
@@ -136,7 +137,7 @@ it with the ``external_stylesheets`` parameter. Additional info on styling boots
 layout can be found at:  https://dash-bootstrap-components.opensource.faculty.ai/docs/themes/
 
 You can add a theme by putting it in an ``/assets/`` subfolder, or by linking to it directly.
-`dash_bootstrap_components` offer a convenient way of inserting these::
+``dash_bootstrap_components`` offer a convenient way of inserting these::
 
    import dash_bootstrap_components as dbc
    ExplainerDashboard(explainer, ["contributions", "model_summary"], 
@@ -190,6 +191,7 @@ You then start the dashboard on the commandline with::
 
    gunicorn dashboard:server
 
+See the deployment section for more info on using explainerdashboard in production.
 
 ExplainerDashboard documentation
 --------------------------------

diff --git a/docs/source/deployment.rst b/docs/source/deployment.rst
@@ -7,7 +7,7 @@ server but use more robust and scalable options like ``gunicorn`` and ``nginx``.
 Deploying a single dashboard instance
 =====================================
 
-``Dash`` is built on top of ``Flask``, and so the dashbaord instance 
+``Dash`` is built on top of ``Flask``, and so the dashboard instance 
 contains a Flask server. You can simply expose this server to host your dashboard.
 
 The server can be found in ``ExplainerDashboard().app.server`` or with
@@ -21,21 +21,17 @@ The code below is from `the deployed example to heroku <https://github.com/oeged
     from explainerdashboard.dashboards import *
     from explainerdashboard.datasets import *
 
-    print('loading data...')
     X_train, y_train, X_test, y_test = titanic_survive()
     train_names, test_names = titanic_names()
 
-    print('fitting model...')
     model = RandomForestClassifier(n_estimators=50, max_depth=5)
     model.fit(X_train, y_train)
 
-    print('building Explainer...')
     explainer = RandomForestClassifierExplainer(model, X_test, y_test, 
                                 cats=['Sex', 'Deck', 'Embarked'],
                                 idxs=test_names, 
                                 labels=['Not survived', 'Survived'])
 
-    print('Building ExplainerDashboard...')
     db = ExplainerDashboard(explainer)
 
     server = db.app.server
@@ -55,7 +51,6 @@ to preload the app before starting::
         gunicorn -w 3 --preload localhost:8050 dashboard:server
 
 
-
 Deploying dashboard as part of Flask app on specific route
 ==========================================================
 
@@ -84,7 +79,40 @@ Now you can start the dashboard by::
 
 And you can visit the dashboard on ``http://localhost:8050/dashboard``.
 
+Avoid timeout by precalculating explainers and loading with joblib
+==================================================================
+
+Some of the calculations in order to generate e.g. the SHAP values and permutation
+importances can take quite a longtime (especially shap interaction values). 
+Long enough the break the startup timeout of gunicorn. Therefore it is better
+to first calculate all these values, save the explainer to disk, and then load
+the explainer when starting the dashboard::
+
+    import joblib
+    from explainerdashboard.explainer import ClassifierExplainer
+    
+    explainer = ClassifierExplainer(model, X_test, y_test, 
+                               cats=['Sex', 'Deck', 'Embarked'],
+                               labels=['Not survived', 'Survived'])
+    explainer.calculate_properties()
+    joblib.dump(explainer, "explainer.pkl")
+
+Then in ``dashboard.py`` load the explainer and start the dashboard:: 
+
+    import joblib
+    from explainerdashboard.dashboards import ExplainerDashboard
+
+    explainer = joblib.load("explainer.pkl")
+    db = ExplainerDashboard(clas_explainer)
+    server = db.app.server 
+
+And start the thing with gunicorn::
+
+    gunicorn -b localhost:8050 dashboard:server
+
+
 Deploying as part of a multipage dash app
 =========================================
 
-**Under Construction**
+**Under Construction**
+
diff --git a/docs/source/explainers.rst b/docs/source/explainers.rst
@@ -5,13 +5,15 @@ Simple example
 ==============
 
 In order to start an ``ExplainerDashboard`` you first need to construct an 
-``Explainer`` instance. They come in four flavours and at its most basic they 
+``Explainer`` instance. They come in six flavours and at its most basic they 
 only need a model, and a test set X and y::
 
     explainer = ClassifierExplainer(model, X_test, y_test)
     explainer = RegressionExplainer(model, X_test, y_test)
     explainer = RandomForestClassifierExplainer(model, X_test, y_test)
     explainer = RandomForestRegressionExplainer(model, X_test, y_test)
+    explainer = XGBClassifierExplainer(model, X_test, y_test)
+    explainer = XGBRegressionExplainer(model, X_test, y_test)
 
 This is enough to launch an ExplainerDashboard::
 
@@ -78,7 +80,7 @@ For the titanic example this would be:
 So you would pass ``cats=['Sex', 'Deck', 'Embarked']``. You can now use these 
 categorical features directly as input for plotting methods, e.g. 
 ``explainer.plot_shap_dependence("Deck")``. For other methods you can pass
-a parameter ``shap=True``, to indicate that you'd like to group the categorical
+a parameter ``cats=True``, to indicate that you'd like to group the categorical
 features in your output. 
 
 idxs
@@ -89,7 +91,7 @@ If you pass these the the Explainer object, you can index using both the
 numerical index, e.g. ``explainer.contrib_df(0)`` for the first row, or using the 
 identifier, e.g. ``explainer.contrib_df("Braund, Mr. Owen Harris")``.
 
-The proper name or idx will be use used in all ``ExplainerComponents`` that
+The proper name or idxs will be use used in all ``ExplainerComponents`` that
 allow index selection.
 
 descriptions
@@ -169,7 +171,7 @@ LogisticRegression::
 permutation_cv
 --------------
 
-Normally permuation importances get calculated over a single fold (assuming the
+Normally permutation importances get calculated over a single fold (assuming the
 data is the test set). However if you pass the training set to the explainer,
 you may wish to cross-validate calculate the permutation importances. In that
 case pass the number of folds to ``permutation_cv``.
@@ -179,7 +181,7 @@ na_fill
 
 If you fill missing values with some extreme value such as ``-999`` (typical for
 tree based methods), these can mess with the horizontal axis of your plots. 
-In order to filter these out, you need to tell the expaliner what the extreme value 
+In order to filter these out, you need to tell the explainer what the extreme value 
 is that you used to fill. Defaults to ``-999``.
 
 Plots
@@ -336,12 +338,12 @@ plot_residuals_vs_feature
 .. automethod:: explainerdashboard.explainers.RegressionExplainer.plot_residuals_vs_feature
 
 
-RandomForest Plots
+DecisionTree Plots
 ------------------
 
-There is an additional mixin class specifically for ``sklearn`` ``RandomForests``
-that defines additional methods and plots to investigate and visualize 
-individual decision trees within the random forest. ``RandomForestExplainer``
+There are additional mixin classes specifically for ``sklearn`` ``RandomForests``
+and for xgboost models that define additional methods and plots to investigate and visualize 
+individual decision trees within the ensemblke. These
 uses the ``dtreeviz`` library to visualize individual decision trees.
 
 You can get a pd.DataFrame summary of the path that a specific index row took 
@@ -350,7 +352,7 @@ You can also plot the individual predictions of each individual tree for
 specific row in your data indentified by ``index``::
 
     explainer.decisiontree_df(tree_idx, index)
-    explainer.decisiontree_df_summary(tree_idx, index)
+    explainer.decisiontree_summary_df(tree_idx, index)
     explainer.plot_trees(index)
 
 And for dtreeviz visualization of individual decision trees (svg format)::
@@ -359,10 +361,14 @@ And for dtreeviz visualization of individual decision trees (svg format)::
     explainer.decision_path_file(tree_idx, index)
     explainer.decision_path_encoded(tree_idx, index)
 
-This also works with classifiers and regression models::
+These methods are not available with the standard ``ClassifierExplainer`` and 
+``RegressionExplainer`` classes so you need to instantiate with the specific
+Explainer classes for these models::
 
     explainer = RandomForestClassifierExplainer(model, X, y)
     explainer = RandomForestRegressionExplainer(model, X, y)
+    explainer = XGBClassifierExplainer(model, X, y)
+    explainer = XGBRegressionExplainer(model, X, y)
 
 
 plot_trees
@@ -514,7 +520,7 @@ RandomForest outputs
 For ``RandomForestExplainer``::
 
     decisiontree_df(tree_idx, index, pos_label=None)
-    decisiontree_df_summary(tree_idx, index, round=2, pos_label=None)
+    decisiontree_summary_df(tree_idx, index, round=2, pos_label=None)
     decision_path_file(tree_idx, index)
 
 
@@ -523,10 +529,10 @@ decisiontree_df
 
 .. automethod:: explainerdashboard.explainers.RandomForestExplainer.decisiontree_df
 
-decisiontree_df_summary
+decisiontree_summary_df
 ^^^^^^^^^^^^^^^^^^^^^^^
 
-.. automethod:: explainerdashboard.explainers.RandomForestExplainer.decisiontree_df_summary
+.. automethod:: explainerdashboard.explainers.RandomForestExplainer.decisiontree_summary_df
 
 decision_path_file
 ^^^^^^^^^^^^^^^^^^
@@ -654,8 +660,14 @@ More examples in the `notebook on the github repo. <https://github.com/oegedijk/
 RandomForestExplainer
 =====================
 
+The ``RandomForestExplainer`` mixin class provides additional functionality
+in order to explore individual decision trees within the RandomForest.
+This can be very useful for showing stakeholders that a RandomForest is
+indeed just a collection of simple decision trees that you then calculate
+the average off. 
+
 .. autoclass:: explainerdashboard.explainers.RandomForestExplainer
-   :members: decisiontree_df, decisiontree_df_summary, plot_trees, decision_path
+   :members: decisiontree_df, decisiontree_summary_df, plot_trees, decision_path
    :member-order: bysource
    :exclude-members: 
    :noindex:
@@ -677,4 +689,36 @@ RandomForestRegressionExplainer
    :noindex:
 
 
+XGBExplainer
+============
+
+The ``XGBExplainer`` mixin class provides additional functionality
+in order to explore individual decision trees within an xgboost ensemble model.
+This can be very useful for showing stakeholders that a xgboost is
+indeed just a collection of simple decision trees that get summed together. 
+
+
+.. autoclass:: explainerdashboard.explainers.XGBExplainer
+   :members: decisiontree_df, decisiontree_summary_df, plot_trees, decision_path
+   :member-order: bysource
+   :exclude-members: 
+   :noindex:
+
+XGBClassifierExplainer
+----------------------
+
+.. autoclass:: explainerdashboard.explainers.XGBClassifierExplainer
+   :member-order: bysource
+   :exclude-members: __init__
+   :noindex:
+
+XGBRegressionExplainer
+----------------------
+
+.. autoclass:: explainerdashboard.explainers.XGBRegressionExplainer
+   :member-order: bysource
+   :exclude-members: __init__
+   :noindex:
+
+
 
diff --git a/docs/source/inline.rst b/docs/source/inline.rst
@@ -4,7 +4,7 @@ InlineExplainer
 As datascientists you often work inside a notebook environment where you 
 quickly interactively like to explore your data. The ``InlineExplainer`` allows
 you to do this by running ``ExplainerComponents`` (or whole tabs) inline 
-inside your Jupyter notebook.
+inside your Jupyter notebook (also works in google colab!).
 
 .. image:: inline_screenshot.png