Skip to content

Commit

Permalink
show the getting started code
Browse files Browse the repository at this point in the history
  • Loading branch information
Mike McKiernan committed Mar 15, 2017
1 parent 94a457a commit 749fca1
Show file tree
Hide file tree
Showing 2 changed files with 188 additions and 64 deletions.
249 changes: 186 additions & 63 deletions sas_kernel/doc/source/getting-started.rst
Original file line number Diff line number Diff line change
@@ -1,163 +1,286 @@
###############
Getting started
###############

Getting Started
===============
The SAS kernel for Juypter is designed to enable users to write programs
for SAS with Jupyter Notebooks. The kernel makes SAS the analytical engine
or "calculator" for data analysis.

The following SAS program is a basic example of programming with SAS
and Jupyter Notebook. The HR_comma_sep.csv file that is referenced in
the sample code is available from https://www.kaggle.com/ludobenistant/hr-analytics.

The SAS Kernel for Juypter is designed to allow users to use Jupyter Notebooks
to interact with SAS (SAS 9.4 or SAS Viya). It makes SAS the
analytical engine or "calculator" for data analysis. In its most simple
form, SASPY is a code translator taking python commands and converting
them into SAS procedure and data step calls and then displaying the
results.

Load Data Into SAS
------------------
******************
Load data into SAS
******************

.. code:: sas
The FILENAME statement is used to specify an external file. The
IMPORT procedure can read data from a variety of external file formats.

.. code-block:: none
filename x "./HR_comma_sep.csv";
proc import datafile=x out=_csv dbms=csv replace; run;
****************
Explore the data
----------------
****************

.. code:: sas
The CONTENTS procedure can be used to display the column names and
data types in a SAS data set. The PRINT procedure can be used to
display rows from a data set. In this case, the results are limited to
the first five rows.

.. code-block:: none
proc contents data=work._csv;
ods select Variables;
run;
.. code:: sas
proc print data=work._csv(obs=5);
run;
.. code:: sas
The MEANS procedure is used to provide descriptive statistics.

.. code-block:: none
proc means data=work._csv n nmiss median mean std min p25 p50 p75 max;
run;
.. code:: sas
The SGPLOT procedure is used to produce a vertical bar charts of frequencies
for salary and sales.

.. code-block:: none
proc sgplot data=work._csv;
vbar salary;
xaxis discreteorder=data;
run;
.. code:: sas
proc sgplot data=work._csv;
vbar sales;
run;
.. code:: sas
.. image:: ./images/sgplot_vbar_salary.png
:scale: 60 %
:alt: Plot of low, medium, and high salary frequencies.

The plot of sales across groups, such as IT, RandD, and so on is similar.


Plot histograms with normal distribution curves.

.. code-block:: none
proc sgplot data=work._csv;
histogram last_evaluation / scale=count;
density last_evaluation;
histogram satisfaction_level / scale=count;
density satisfaction_level;
run;
.. code:: sas
proc sgplot data=work._csv;
histogram time_spend_company / scale=count;
density time_spend_company;
run;
proc sgplot data=work._csv;
histogram last_evaluation / scale=count;
density last_evaluation;
run;
proc sgplot data=work._csv;
histogram satisfaction_level / scale=count;
density satisfaction_level;
run;
.. code:: sas
.. image:: ./images/hist_satisfaction_level.png
:scale: 60 %
:alt: Histogram of employee satisfaction.

The plots for time spent with the company and last evaluation
are similar.

Plot a heatmap that shows the relationship between employee
satisfaction and the last evaluation.

.. code-block:: none
proc sgplot data=work._csv;
heatmap x=last_evaluation y=satisfaction_level;
run;
.. code:: sas
.. image:: ./images/heatmap_satisfaction_evaluation.png
:scale: 60 %
:alt: Heatmap of employee statisfaction and evaluation.

There is a small frequency spike in the lower-right corner
of the heatmap.

Narrow the heatmap to the employees that have low satisfaction
but were evaluated highly.

proc sgplot data=work._csv(where=(satisfaction_level <.2 and last_evaluation>.7));
.. code-block:: none
proc sgplot data=work._csv(where=(satisfaction_level <.2 and last_evaluation >. 7));
heatmap x=last_evaluation y=satisfaction_level;
run;
.. code:: sas
Finally, split the median satisfaction level for retained employees
side-by-side with the median satisfaction for employees who left.

.. code-block:: none
proc sgpanel data=work._csv;
*where satisfaction_level <.2 and last_evaluation>.7;
*where satisfaction_level <.2 and last_evaluation > .7;
PANELBY left;
hbar sales / response=last_evaluation stat=median;
hbar sales / response=last_evaluation stat=median;
hbar sales / response=satisfaction_level stat=median ;
run;
.. tip:: You can remove the asterisk to plot the employees with
low satisfaction and high evaluations.

.. image:: ./images/panel_left_sales.png
:scale: 60 %
:alt: Heatmap of employee statisfaction and evaluation for
employees with low satisfaction and high evaluation.



*************************************
Split the data into training and test
-------------------------------------
*************************************

Replace the data set with one that includes a partitioning indicator.
The variable is named _PartInd_ and indicates whether the partition
is part of the training data (1) or the test data (0). Seventy percent
of the data is included in training and the remainder is used for testing.

.. code:: sas
.. code-block:: none
proc hpsample data=work._csv out=work._csv samppct=70 seed=9878 partition;
class left _character_;
target left;
var work_accident average_montly_hours last_evaluation number_project promotion_last_5years satisfaction_level
time_spend_company;
var work_accident average_montly_hours last_evaluation number_project
promotion_last_5years satisfaction_level time_spend_company;
run;
`Decision Tree <http://support.sas.com/documentation/cdl/en/stathpug/68163/HTML/default/viewer.htm#stathpug_hpsplit_toc.htm>`__
-------------------------------------------------------------------------------------------------------------------------------
.. code:: sas
************************
Train a series of models
************************

Decision tree
=============

The HPSPLIT procedure can train a decision tree model. For documentation,
see
http://documentation.sas.com/?docsetId=statug&docsetVersion=14.2&docsetTarget=statug_hpsplit_toc.htm.

.. code-block:: none
proc hpsplit data=work._csv(where=(_partInd_=1)) plot=all;
class work_accident promotion_last_5years sales salary;
model left = work_accident promotion_last_5years sales salary
satisfaction_level time_spend_company number_project average_montly_hours;
satisfaction_level time_spend_company number_project
average_montly_hours;
run;
GLM
---
The results include several tables and plots. Only the variable
information table is shown below.

.. image:: ./images/hpsplit_variable_importance.png
:scale: 60 %
:alt: Variable importance for modeling 'left' with the HPSPLIT procedure.


Generalized linear model
========================

The GLM procedure can train a generalized linear model. For documentation,
see
http://documentation.sas.com/?docsetId=statug&docsetVersion=14.2&docsetTarget=statug_glm_toc.htm.

.. code:: sas

.. code-block:: none
proc glm data=work._csv(where=(_partInd_=1)) plot=all;
class work_accident promotion_last_5years sales salary;
model left = work_accident promotion_last_5years sales salary
satisfaction_level time_spend_company number_project average_montly_hours;
satisfaction_level time_spend_company number_project
average_montly_hours;
run;
Logistic
--------
The results include several tables. Only the basic statistics and Type I sum of
squares are shown.

.. image:: ./images/glm_rsq_ss1.png
:scale: 60 %
:alt: R-square, basic statistics, and Type I sum of squares.

Logistic regression
===================

The HPLOGISTIC procedure can create logistic regression models. For documentation, see
http://documentation.sas.com/?docsetId=statug&docsetVersion=14.2&docsetTarget=statug_glm_overview.htm.

.. code:: sas
.. code-block:: none
proc hplogistic data=work._csv(where=(_partInd_=1));
class work_accident promotion_last_5years sales salary;
model left = work_accident promotion_last_5years sales salary
satisfaction_level time_spend_company number_project average_montly_hours;
model left (event='1') = work_accident promotion_last_5years sales salary
satisfaction_level time_spend_company number_project
average_montly_hours;
run;
Neural Network
--------------
Neural network
==============

The HPNERUAL procedure is available with a SAS Enterprise Miner license.

The procedure trains a multilayer preceptron neural network. For documentation, see
http://documentation.sas.com/?docsetId=emhpprcref&docsetVersion=14.2&docsetTarget=emhpprcref_hpneural_toc.htm.

.. code:: sas
.. code-block:: none
proc hpneural data=work._csv;
hidden 19;
input work_accident promotion_last_5years sales salary / level=nominal;
input satisfaction_level time_spend_company number_project average_montly_hours / level=interval;
target left /level=nominal;
input work_accident promotion_last_5years sales salary
/ level=nominal;
input satisfaction_level time_spend_company number_project average_montly_hours
/ level=interval;
target left / level=nominal;
train numtries=15 maxiter=300;
run;
Decision Forest
---------------
The results include several tables. The fit statistics and misclassification tables
are shown below.

.. code:: sas
.. image:: ./images/neural_fit_misclass.png
:scale: 60 %
:alt: Fit statistics and misclassification information.

Decision forest
===============

The HPFOREST procedure is available with a SAS Enterprise Miner license.

The HPFOREST procedure creates a forest of many decision trees and creates a predictive
model. For documentation, see http://documentation.sas.com/?docsetId=emhpprcref&docsetVersion=14.2&docsetTarget=emhpprcref_hpforest_toc.htm.

.. code-block:: none
proc hpforest data=work._csv;
input work_accident promotion_last_5years sales salary / level=nominal;
input satisfaction_level time_spend_company number_project average_montly_hours / level=interval;
target left /level=nominal;
input work_accident promotion_last_5years sales salary
/ level=nominal;
input satisfaction_level time_spend_company number_project average_montly_hours
/ level=interval;
target left / level=nominal;
run;
The results include several tables of information. The loss reduction and variable
importance table is shown below.

.. image:: ./images/forest_loss_reduction.png
:scale: 60 %
:alt: Loss reduction and variable importance for the HPFOREST procedure.

3 changes: 2 additions & 1 deletion sas_kernel/doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,7 @@ OSX (Mac) install
sas /usr/local/share/jupyter/kernels/sas

#. Verify that the SAS executable is correct.

#. Find the sascfg.py file -- it is located relative to the Python3 installation
location (see above, [install location]/site-packages/saspy/sascfg.py).

Expand Down Expand Up @@ -366,6 +367,6 @@ Example
=======

There is a `notebook`_ that walks through the steps to install and
enable the extensions:
enable the extensions.

.. _notebook: https://github.com/sassoftware/sas_kernel/blob/master/notebook/loadSASExtensions.ipynb

0 comments on commit 749fca1

Please sign in to comment.