From 749fca1f4ac864ac33b1f52808fd74c617ebead0 Mon Sep 17 00:00:00 2001 From: Mike McKiernan Date: Wed, 15 Mar 2017 14:20:49 -0400 Subject: [PATCH] show the getting started code --- sas_kernel/doc/source/getting-started.rst | 249 ++++++++++++++++------ sas_kernel/doc/source/install.rst | 3 +- 2 files changed, 188 insertions(+), 64 deletions(-) diff --git a/sas_kernel/doc/source/getting-started.rst b/sas_kernel/doc/source/getting-started.rst index cb39f9e..f127afe 100644 --- a/sas_kernel/doc/source/getting-started.rst +++ b/sas_kernel/doc/source/getting-started.rst @@ -1,163 +1,286 @@ +############### +Getting started +############### -Getting Started -=============== +The SAS kernel for Juypter is designed to enable users to write programs +for SAS with Jupyter Notebooks. The kernel makes SAS the analytical engine +or "calculator" for data analysis. + +The following SAS program is a basic example of programming with SAS +and Jupyter Notebook. The HR_comma_sep.csv file that is referenced in +the sample code is available from https://www.kaggle.com/ludobenistant/hr-analytics. -The SAS Kernel for Juypter is designed to allow users to use Jupyter Notebooks -to interact with SAS (SAS 9.4 or SAS Viya). It makes SAS the -analytical engine or "calculator" for data analysis. In its most simple -form, SASPY is a code translator taking python commands and converting -them into SAS procedure and data step calls and then displaying the -results. -Load Data Into SAS ------------------- +****************** +Load data into SAS +****************** -.. code:: sas +The FILENAME statement is used to specify an external file. The +IMPORT procedure can read data from a variety of external file formats. + +.. code-block:: none filename x "./HR_comma_sep.csv"; proc import datafile=x out=_csv dbms=csv replace; run; + +**************** Explore the data ----------------- +**************** -.. code:: sas +The CONTENTS procedure can be used to display the column names and +data types in a SAS data set. The PRINT procedure can be used to +display rows from a data set. In this case, the results are limited to +the first five rows. + +.. code-block:: none proc contents data=work._csv; ods select Variables; run; -.. code:: sas - proc print data=work._csv(obs=5); run; -.. code:: sas +The MEANS procedure is used to provide descriptive statistics. + +.. code-block:: none proc means data=work._csv n nmiss median mean std min p25 p50 p75 max; run; -.. code:: sas + +The SGPLOT procedure is used to produce a vertical bar charts of frequencies +for salary and sales. + +.. code-block:: none proc sgplot data=work._csv; vbar salary; + xaxis discreteorder=data; run; -.. code:: sas - proc sgplot data=work._csv; vbar sales; run; -.. code:: sas +.. image:: ./images/sgplot_vbar_salary.png + :scale: 60 % + :alt: Plot of low, medium, and high salary frequencies. + +The plot of sales across groups, such as IT, RandD, and so on is similar. + + +Plot histograms with normal distribution curves. + +.. code-block:: none proc sgplot data=work._csv; - histogram last_evaluation / scale=count; - density last_evaluation; + histogram satisfaction_level / scale=count; + density satisfaction_level; run; -.. code:: sas - proc sgplot data=work._csv; histogram time_spend_company / scale=count; density time_spend_company; run; + proc sgplot data=work._csv; histogram last_evaluation / scale=count; density last_evaluation; run; - proc sgplot data=work._csv; - histogram satisfaction_level / scale=count; - density satisfaction_level; - run; -.. code:: sas +.. image:: ./images/hist_satisfaction_level.png + :scale: 60 % + :alt: Histogram of employee satisfaction. + +The plots for time spent with the company and last evaluation +are similar. + +Plot a heatmap that shows the relationship between employee +satisfaction and the last evaluation. + +.. code-block:: none proc sgplot data=work._csv; heatmap x=last_evaluation y=satisfaction_level; run; -.. code:: sas +.. image:: ./images/heatmap_satisfaction_evaluation.png + :scale: 60 % + :alt: Heatmap of employee statisfaction and evaluation. + +There is a small frequency spike in the lower-right corner +of the heatmap. + +Narrow the heatmap to the employees that have low satisfaction +but were evaluated highly. - proc sgplot data=work._csv(where=(satisfaction_level <.2 and last_evaluation>.7)); +.. code-block:: none + + proc sgplot data=work._csv(where=(satisfaction_level <.2 and last_evaluation >. 7)); heatmap x=last_evaluation y=satisfaction_level; run; -.. code:: sas + +Finally, split the median satisfaction level for retained employees +side-by-side with the median satisfaction for employees who left. + +.. code-block:: none proc sgpanel data=work._csv; - *where satisfaction_level <.2 and last_evaluation>.7; + *where satisfaction_level <.2 and last_evaluation > .7; PANELBY left; - hbar sales / response=last_evaluation stat=median; + hbar sales / response=last_evaluation stat=median; hbar sales / response=satisfaction_level stat=median ; run; +.. tip:: You can remove the asterisk to plot the employees with + low satisfaction and high evaluations. + +.. image:: ./images/panel_left_sales.png + :scale: 60 % + :alt: Heatmap of employee statisfaction and evaluation for + employees with low satisfaction and high evaluation. + + +************************************* Split the data into training and test -------------------------------------- +************************************* + +Replace the data set with one that includes a partitioning indicator. +The variable is named _PartInd_ and indicates whether the partition +is part of the training data (1) or the test data (0). Seventy percent +of the data is included in training and the remainder is used for testing. -.. code:: sas +.. code-block:: none proc hpsample data=work._csv out=work._csv samppct=70 seed=9878 partition; class left _character_; target left; - var work_accident average_montly_hours last_evaluation number_project promotion_last_5years satisfaction_level - time_spend_company; + var work_accident average_montly_hours last_evaluation number_project + promotion_last_5years satisfaction_level time_spend_company; run; -`Decision Tree `__ -------------------------------------------------------------------------------------------------------------------------------- -.. code:: sas +************************ +Train a series of models +************************ + +Decision tree +============= + +The HPSPLIT procedure can train a decision tree model. For documentation, +see +http://documentation.sas.com/?docsetId=statug&docsetVersion=14.2&docsetTarget=statug_hpsplit_toc.htm. + +.. code-block:: none proc hpsplit data=work._csv(where=(_partInd_=1)) plot=all; class work_accident promotion_last_5years sales salary; model left = work_accident promotion_last_5years sales salary - satisfaction_level time_spend_company number_project average_montly_hours; + satisfaction_level time_spend_company number_project + average_montly_hours; run; -GLM ---- +The results include several tables and plots. Only the variable +information table is shown below. + +.. image:: ./images/hpsplit_variable_importance.png + :scale: 60 % + :alt: Variable importance for modeling 'left' with the HPSPLIT procedure. + + +Generalized linear model +======================== + +The GLM procedure can train a generalized linear model. For documentation, +see +http://documentation.sas.com/?docsetId=statug&docsetVersion=14.2&docsetTarget=statug_glm_toc.htm. -.. code:: sas + +.. code-block:: none proc glm data=work._csv(where=(_partInd_=1)) plot=all; class work_accident promotion_last_5years sales salary; model left = work_accident promotion_last_5years sales salary - satisfaction_level time_spend_company number_project average_montly_hours; + satisfaction_level time_spend_company number_project + average_montly_hours; run; -Logistic --------- +The results include several tables. Only the basic statistics and Type I sum of +squares are shown. + +.. image:: ./images/glm_rsq_ss1.png + :scale: 60 % + :alt: R-square, basic statistics, and Type I sum of squares. + +Logistic regression +=================== + +The HPLOGISTIC procedure can create logistic regression models. For documentation, see +http://documentation.sas.com/?docsetId=statug&docsetVersion=14.2&docsetTarget=statug_glm_overview.htm. -.. code:: sas +.. code-block:: none proc hplogistic data=work._csv(where=(_partInd_=1)); class work_accident promotion_last_5years sales salary; - model left = work_accident promotion_last_5years sales salary - satisfaction_level time_spend_company number_project average_montly_hours; + model left (event='1') = work_accident promotion_last_5years sales salary + satisfaction_level time_spend_company number_project + average_montly_hours; + run; -Neural Network --------------- +Neural network +============== + +The HPNERUAL procedure is available with a SAS Enterprise Miner license. + +The procedure trains a multilayer preceptron neural network. For documentation, see +http://documentation.sas.com/?docsetId=emhpprcref&docsetVersion=14.2&docsetTarget=emhpprcref_hpneural_toc.htm. -.. code:: sas +.. code-block:: none proc hpneural data=work._csv; hidden 19; - input work_accident promotion_last_5years sales salary / level=nominal; - input satisfaction_level time_spend_company number_project average_montly_hours / level=interval; - target left /level=nominal; + input work_accident promotion_last_5years sales salary + / level=nominal; + input satisfaction_level time_spend_company number_project average_montly_hours + / level=interval; + target left / level=nominal; train numtries=15 maxiter=300; run; -Decision Forest ---------------- +The results include several tables. The fit statistics and misclassification tables +are shown below. -.. code:: sas +.. image:: ./images/neural_fit_misclass.png + :scale: 60 % + :alt: Fit statistics and misclassification information. + +Decision forest +=============== + +The HPFOREST procedure is available with a SAS Enterprise Miner license. + +The HPFOREST procedure creates a forest of many decision trees and creates a predictive +model. For documentation, see http://documentation.sas.com/?docsetId=emhpprcref&docsetVersion=14.2&docsetTarget=emhpprcref_hpforest_toc.htm. + +.. code-block:: none proc hpforest data=work._csv; - input work_accident promotion_last_5years sales salary / level=nominal; - input satisfaction_level time_spend_company number_project average_montly_hours / level=interval; - target left /level=nominal; + input work_accident promotion_last_5years sales salary + / level=nominal; + input satisfaction_level time_spend_company number_project average_montly_hours + / level=interval; + target left / level=nominal; run; +The results include several tables of information. The loss reduction and variable +importance table is shown below. + +.. image:: ./images/forest_loss_reduction.png + :scale: 60 % + :alt: Loss reduction and variable importance for the HPFOREST procedure. + diff --git a/sas_kernel/doc/source/install.rst b/sas_kernel/doc/source/install.rst index 3cd4a95..41885c2 100644 --- a/sas_kernel/doc/source/install.rst +++ b/sas_kernel/doc/source/install.rst @@ -244,6 +244,7 @@ OSX (Mac) install sas /usr/local/share/jupyter/kernels/sas #. Verify that the SAS executable is correct. + #. Find the sascfg.py file -- it is located relative to the Python3 installation location (see above, [install location]/site-packages/saspy/sascfg.py). @@ -366,6 +367,6 @@ Example ======= There is a `notebook`_ that walks through the steps to install and -enable the extensions: +enable the extensions. .. _notebook: https://github.com/sassoftware/sas_kernel/blob/master/notebook/loadSASExtensions.ipynb