add tensorflow example (#42)

* first version from google docs * in-progress effort on tensorflow. submitting for visibility of status * accuracy tweak * fix description * small tweak * save the link id * fixing parsing of json article * adding in rapidApi integration * final working script, with minimal API usage * final README.md * fix formatting * fix formatting * removing duplicate import * Update README.md * adding line to main import * fix url Co-authored-by: margaretkennedy <[email protected]>
deephaven · Sep 23, 2021 · c9e1dc2 · c9e1dc2
1 parent a43729f
commit c9e1dc2
Show file tree

Hide file tree

Showing 4 changed files with 403 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -16,6 +16,8 @@ The following folders can be found in this repository:
 - **[`metriccentury`](https://github.com/deephaven/examples/tree/main/metriccentury)** - Data recorded from a 100 km bike ride
 - **[`pems`](https://pems.dot.ca.gov/)** - Traffic flow data collected near Davis, CA.
 - **[`taxi`](https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/)** - Yellow Taxi trip records
+- **[`tensorflow`](https://www.tensorflow.org/)** - Statistically calculate positive/negative sentiment using machine-learning
+  training mechanisms based on an RSS feed from Seeking Alpha.
 - **[`fit`](https://www.strava.com/)** - Workout results in the proprietary fit format developed by Garmin. Downloadable from Strava.
 - **[`tickingHeartRate`]** - Simulated ticking heart rate data.
 

diff --git a/tensorflow/README.md b/tensorflow/README.md
@@ -0,0 +1,36 @@
+# Tensorflow example demonstrating data from Seeking Alpha
+
+Pull a RSS feed from Seeking Alpha, and statistically calculate positive/negative sentiment using machine-learning
+training mechanisms.
+
+## Table of contents
+
+* `tensorflow.py` - Python script to run.
+* `trainData.csv` - The input data to train the AI algorithm.
+
+## Steps to run
+
+1. Install Python modules:
+   `docker exec $(basename $(pwd))_grpc-api_1 pip install tensorflow tensorflow_hub sklearn spacy bs4 lxml`
+   Note: please use this exact install mechanism, rather than variations
+   from [How to install Python packages](https://deephaven.io/core/docs/how-to-guides/install-python-packages).
+   The lxml installation is somewhat fragile in allowing bs4 to see that it has been installed. 
+   See <https://github.com/deephaven/deephaven-core/discussions/1299> for more information.
+1. Install the spacy english module:
+   `docker exec $(basename $(pwd))_grpc-api_1 pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz`
+   Alternatively, use another version from here:
+   <https://github.com/explosion/spacy-models/releases/>
+1. Drag/drop the file `trainData.csv` onto the Deephaven console.
+1. Get a login to <https://rapidapi.com/developer> (free) and subscribe to <https://rapidapi.com/apidojo/api/seeking-alpha/>.
+    * Note that every time you run the script, you will consume some quota of your API usage for this particular
+      endpoint. This is kept minimal: a single API access of each published article being advertised by Seeking Alpha
+      on any one day (using the `knownLinks[]` variable within the script). However, to allow repeated iterations for
+      debug/troubleshooting, all variables are reset on a new script run, and hence another round of API calls is
+      required for each run.
+    * The number of API calls per day is usually small(~5-30), so provided the script is only run once-per-day, the free
+      tier of 500 calls/month should be adequate for demonstrative purposes.
+    * API call usage can be seen here: <https://rapidapi.com/developer/dashboard>
+1. Look at any of the endpoint examples, and **select+save** your unique endpoint API key. It is called `x-rapidapi-key`.
+1. Import your key into Deephaven by running: 
+  `ra_sa_key='enter-your-key-here'` (avoiding any additional space/quote characters)
+1. Run `tensorflow.py`.