From 39e7f27998fbf2d345048e3a519e28cb963d281f Mon Sep 17 00:00:00 2001
From: Aaron Franklin <455404+afranklin@users.noreply.github.com>
Date: Wed, 7 Feb 2018 14:37:22 -0800
Subject: [PATCH] Add more documentation datasets (#251)

---
 userguide/activity_classifier/README.md               |  4 ++--
 userguide/activity_classifier/data-preperation.md     |  4 ++--
 userguide/clustering/dbscan.md                        |  2 +-
 userguide/clustering/kmeans.md                        |  5 +++--
 userguide/datasets.md                                 |  2 ++
 userguide/image_classifier/README.md                  | 11 ++++++-----
 userguide/image_similarity/README.md                  |  2 +-
 userguide/recommender/README.md                       |  2 +-
 userguide/sframe/sframe-intro.md                      |  2 +-
 .../supervised-learning/boosted_trees_classifier.md   |  2 +-
 .../supervised-learning/boosted_trees_regression.md   |  2 +-
 .../supervised-learning/decision_tree_classifier.md   |  2 +-
 .../supervised-learning/decision_tree_regression.md   |  2 +-
 .../supervised-learning/random_forest_classifier.md   |  2 +-
 .../supervised-learning/random_forest_regression.md   |  2 +-
 15 files changed, 25 insertions(+), 21 deletions(-)
 create mode 100644 userguide/datasets.md
diff --git a/userguide/activity_classifier/README.md b/userguide/activity_classifier/README.md
index 486c2325db..b648174cc3 100644
--- a/userguide/activity_classifier/README.md
+++ b/userguide/activity_classifier/README.md
@@ -8,7 +8,7 @@ The activity classifier in Turi Create creates a deep learning model capable of
 
 #### Introductory Example
 
-In this example we create a model to classify physical activities done by users of a handheld phone, using both accelerometer and gyroscope data. We will use data from the [HAPT experiment](http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions) which contains recording sessions of multiple users, each performing certain physical activities. The performed activities are walking, climbing up stairs, climbing down stairs, sitting, standing, and laying.
+In this example we create a model to classify physical activities done by users of a handheld phone, using both accelerometer and gyroscope data. We will use data from the [HAPT experiment](http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions) which contains recording sessions of multiple users, each performing certain physical activities.[<sup>1</sup>](../datasets.md) The performed activities are walking, climbing up stairs, climbing down stairs, sitting, standing, and laying.
 
 Sensor data can be collected at varying frequencies. In the HAPT dataset, the sensors were sampled at 50Hz each - meaning 50 times per second. However, most applications would want to show outputs to the user at larger intervals. We control the output prediction rate via the ```prediction_window``` parameter. For example, if we want to produce a prediction every 5 seconds, and the sensors are sampled at 50Hz - we would set the ```prediction_window``` to 250 (5 sec * 50 samples per second).
 
@@ -66,4 +66,4 @@ We've seen how we can quickly create an activity classifier given recorded sessi
 
 * [Advanced usage](advanced-usage.md)
 * [Deployment via Core ML](export_coreml.md)
-* [How does it work](how-it-works.md)
\ No newline at end of file
+* [How does it work](how-it-works.md)
diff --git a/userguide/activity_classifier/data-preperation.md b/userguide/activity_classifier/data-preperation.md
index 3086cd6e3c..b9ac934bed 100644
--- a/userguide/activity_classifier/data-preperation.md
+++ b/userguide/activity_classifier/data-preperation.md
@@ -1,6 +1,6 @@
 # HAPT Data Preparation
 
-In this section we will see how to get the [HAPT experiment](http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions) data into the SFrame format expected by the activity classifier.
+In this section we will see how to get the [HAPT experiment](http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions) data into the SFrame format expected by the activity classifier.[<sup>1</sup>](../datasets.md)
 
 First we need to download the data from [here](http://archive.ics.uci.edu/ml/machine-learning-databases/00341/HAPT%20Data%20Set.zip) in zip format. The code below assumes the data was unzipped into a directory named `HAPT Data Set`. This folder contains 3 types of files - a file containing the performed activities for each experiment, files containing the collected accelerometer samples, and files containing the collected gyroscope samples.
 
@@ -93,4 +93,4 @@ data = data.remove_column('activity_id')
 data.save('hapt_data.sframe')
 ```
 
-To learn more about the expected input format of the activity classifier please visit the [advanced usage](advanced-usage.md) section.
\ No newline at end of file
+To learn more about the expected input format of the activity classifier please visit the [advanced usage](advanced-usage.md) section.
diff --git a/userguide/clustering/dbscan.md b/userguide/clustering/dbscan.md
index 2d38216555..6fdd389ced 100644
--- a/userguide/clustering/dbscan.md
+++ b/userguide/clustering/dbscan.md
@@ -50,7 +50,7 @@ advantages:
 
 To illustrate the basic usage of DBSCAN and how the results can differ from
 K-means, we simulate non-spherical, low-dimensional data using the scikit-learn
-datasets module.
+datasets module.[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/clustering/kmeans.md b/userguide/clustering/kmeans.md
index 3b43ada48b..515fe0bae1 100644
--- a/userguide/clustering/kmeans.md
+++ b/userguide/clustering/kmeans.md
@@ -28,8 +28,9 @@ distance from point $$x$$ to center $$B$$ when assigning $$x$$ to a cluster.
 
 #### Basic Usage
 
-We illustrate usage of Turi Create K-means with a dataset used to classify 
-schizophrenic subjects based on MRI scans. The original data consists of
+We illustrate usage of Turi Create K-means with the dataset from the [June
+2014 Kaggle competition to classify schizophrenic subjects based on MRI
+scans](https://www.kaggle.com/c/mlsp-2014-mri). Download **Train.zip** from the data tab.[<sup>1</sup>](../datasets.md) The original data consists of
 two sets of features: functional network connectivity (FNC) features and
 source-based morphometry (SBM) features, which we incorporate into a single
 [`SFrame`](https://apple.github.io/turicreate/docs/api/generated/turicreate.SFrame.html)
diff --git a/userguide/datasets.md b/userguide/datasets.md
new file mode 100644
index 0000000000..d7d3210a61
--- /dev/null
+++ b/userguide/datasets.md
@@ -0,0 +1,2 @@
+# User Guide Datasets
+Apple has provided links to certain datasets for reference purposes only and on an “as is” basis. You are solely responsible for your use of the datasets and for complying with applicable terms and conditions, including any use restrictions and attribution requirements. Apple shall not be liable for, and specifically disclaims any warranties, express or implied, in connection with, the use of the datasets, including any warranties of fitness for a particular purpose or non-infringement. 
diff --git a/userguide/image_classifier/README.md b/userguide/image_classifier/README.md
index 86da57b7d7..6b58ffbb17 100644
--- a/userguide/image_classifier/README.md
+++ b/userguide/image_classifier/README.md
@@ -12,16 +12,16 @@ create a high quality image classifier model.
 
 #### Loading Data
 
-Suppose we have a dataset containing labeled cat and dog images.
+The [Kaggle Cats and Dogs Dataset](https://www.microsoft.com/en-us/download/details.aspx?id=54765) provides labeled cat and dog images.[<sup>1</sup>](../datasets.md) After downloading and decompressing the dataset, navigate to the main **kagglecatsanddogs** folder, which contains a **PetImages** subfolder.
 
 ```python
 import turicreate as tc
 
-# Load images
-data = tc.image_analysis.load_images('train', with_path=True)
+# Load images (Note: you can ignore 'Not a JPEG file' errors)
+data = tc.image_analysis.load_images('PetImages', with_path=True)
 
 # From the path-name, create a label column
-data['label'] = data['path'].apply(lambda path: 'dog' if 'dog' in path else 'cat')
+data['label'] = data['path'].apply(lambda path: 'dog' if '/Dog' in path else 'cat')
 
 # Save the data for future use
 data.save('cats-dogs.sframe')
@@ -44,7 +44,8 @@ data =  tc.SFrame('cats-dogs.sframe')
 # Make a train-test split
 train_data, test_data = data.random_split(0.8)
 
-# Automatically picks the right model based on your data.
+# Automatically pick the right model based on your data.
+# Note: Because the dataset is large, model creation may take hours.
 model = tc.image_classifier.create(train_data, target='label')
 
 # Save predictions to an SArray
diff --git a/userguide/image_similarity/README.md b/userguide/image_similarity/README.md
index bb0c1ad023..c7413e406f 100644
--- a/userguide/image_similarity/README.md
+++ b/userguide/image_similarity/README.md
@@ -13,7 +13,7 @@ unsupervised.
 In this example, we use the [Caltech-101
 dataset](http://www.vision.caltech.edu/Image_Datasets/Caltech101/)
 which contains images objects belonging to 101 categories with about 40
-to 800 images per category.
+to 800 images per category.[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/recommender/README.md b/userguide/recommender/README.md
index cd7199e798..5891e50c87 100644
--- a/userguide/recommender/README.md
+++ b/userguide/recommender/README.md
@@ -9,7 +9,7 @@ interaction data and use that model to make recommendations.
 Creating a recommender model typically requires a data set to use for
 training the model, with columns that contain the user IDs, the item
 IDs, and (optionally) the ratings. For this example, we use the [MovieLens
- dataset](https://grouplens.org/datasets/movielens/).
+20M dataset](https://grouplens.org/datasets/movielens/20m/).[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/sframe/sframe-intro.md b/userguide/sframe/sframe-intro.md
index 471f0ad1c0..ddf79e0a47 100644
--- a/userguide/sframe/sframe-intro.md
+++ b/userguide/sframe/sframe-intro.md
@@ -13,7 +13,7 @@ A very common data format is the comma separated value (csv) file, which
 is what we'll use for these examples.  We will use some preprocessed data from
 the
 [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/) to
-aid our SFrame-related examples.  The first table contains metadata
+aid our SFrame-related examples.[<sup>1</sup>](../datasets.md)  The first table contains metadata
 about each song in the database.  Here's how we load it into an SFrame:
 
 ```python
diff --git a/userguide/supervised-learning/boosted_trees_classifier.md b/userguide/supervised-learning/boosted_trees_classifier.md
index fccd720dec..1fc45f5341 100644
--- a/userguide/supervised-learning/boosted_trees_classifier.md
+++ b/userguide/supervised-learning/boosted_trees_classifier.md
@@ -10,7 +10,7 @@ decision trees.
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 ```python
 import turicreate as tc
 
diff --git a/userguide/supervised-learning/boosted_trees_regression.md b/userguide/supervised-learning/boosted_trees_regression.md
index 518df45285..d16a5c9ea5 100644
--- a/userguide/supervised-learning/boosted_trees_regression.md
+++ b/userguide/supervised-learning/boosted_trees_regression.md
@@ -51,7 +51,7 @@ The algorithm simply fit a new decision tree to the residual at each iteration.
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/supervised-learning/decision_tree_classifier.md b/userguide/supervised-learning/decision_tree_classifier.md
index 55498d5c5a..724e8d820b 100644
--- a/userguide/supervised-learning/decision_tree_classifier.md
+++ b/userguide/supervised-learning/decision_tree_classifier.md
@@ -8,7 +8,7 @@ on decision trees.
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 ```python
 import turicreate as tc
 
diff --git a/userguide/supervised-learning/decision_tree_regression.md b/userguide/supervised-learning/decision_tree_regression.md
index 6249bb1d3e..b23d98f4fa 100644
--- a/userguide/supervised-learning/decision_tree_regression.md
+++ b/userguide/supervised-learning/decision_tree_regression.md
@@ -11,7 +11,7 @@ for more details).
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/supervised-learning/random_forest_classifier.md b/userguide/supervised-learning/random_forest_classifier.md
index 1352cbbf59..0ea31378d8 100644
--- a/userguide/supervised-learning/random_forest_classifier.md
+++ b/userguide/supervised-learning/random_forest_classifier.md
@@ -8,7 +8,7 @@ forests.
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 ```python
 import turicreate as tc
 
diff --git a/userguide/supervised-learning/random_forest_regression.md b/userguide/supervised-learning/random_forest_regression.md
index 9013a06873..72637d78f1 100644
--- a/userguide/supervised-learning/random_forest_regression.md
+++ b/userguide/supervised-learning/random_forest_regression.md
@@ -24,7 +24,7 @@ forests, all the base models are constructed independently using a
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc