Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated/new documentation for custom nodes, CLI, tests #83

Merged
merged 6 commits into from
May 6, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
258 changes: 165 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,96 +1,168 @@
# Introduction
![Postman Tests](https://github.com/matthew-t-smith/visual-programming/workflows/Postman%20Tests/badge.svg)
![Code Coverage](./docs/media/coverage.svg)
# PyWorkflow
| | |
|------------|--------|
| Docker | TBD |
| Back-end | ![Postman Tests](https://github.com/matthew-t-smith/visual-programming/workflows/Postman%20Tests/badge.svg) |
| Front-end | TBD |
| PyWorkflow | ![Code Coverage](./docs/media/pyworkflow_coverage.svg) |
| CLI | TBD |
| Jest | TBD |

PyWorkflow is a visual programming application for building data science
pipelines and workflows. It is inspired by [KNIME](https://www.knime.com)
and aims to bring the desktop-based experience to a web-based environment.
PyWorkflow takes a Python-first approach and leverages the power of *pandas*
DataFrames to bring data-science to the masses.

![Pyworkflow UI](./docs/media/pyworkflow-ui.png)

So far the app comprises a Django app and a SPA React app (bootstrapped with
create-react-app). For React to request data from Django, the `proxy` field is
set in `front-end/package.json`, telling the dev server to fetch non-static
data from `localhost:8000` **where the Django app must be running**.

## Django

### Install Dependencies
1. Install `pipenv` from home directory

- **Homebrew**:

- `brew install pipenv`

- **pip**:

- `pip install pipenv`
- or depending on your versioning setup:
- `pip3 install pipenv`

- You can install at the User level using **pip** via: `pip install --user pipenv`

2. `cd` to top level of project (contains `Pipfile` and `Pipefile.lock`)

3. Install dependencies

- `pipenv install`

4. Activate and exit the shell

- `pipenv shell`
- `exit`

5. Or, run single commands

- `pipenv run python [COMMAND]`

### Installing new packages
- Simply install via: `pipenv install [package-name]`

### Create dotenv file with app secret
- `echo "SECRET_KEY='TEMPORARY SECRET KEY'" > vp/.environment`

### Start dev server from app root
- `cd vp`
- `pipenv run python manage.py runserver`

---
## React

### Install Prerequisites
- `cd front-end`
- `npm install`

### Start dev server
- `npm start`

---
## CLI
1. Run pipenv shell.
2. Create a workflow using UI and save it.
3. Run it as: pyworkflow execute workflow-file

Also accepts reading input from std (i.e < file.csv) and writing to sdt out (i.e > output.csv)



---
## Tests
PyWorkflow currently has two sets of tests: API endpoints and unit tests.
The API tests are written in Postman and can be run individually, by importing
the collection and environment into your Postman application, or via the command
line by [installing Newman](https://www.npmjs.com/package/newman) and running:

- `cd Postman`
- `newman run PyWorkflow-runner.postman_collection.json --environment Local-env.postman_environment.json`

Unit tests for the PyWorkflow package are run using Python's built-in `unittest`
package.

- `cd pyworkflow/pyworkflow`
- `pipenv run python3 -m unittest tests/*.py`

To see coverage, you can use the `coverage` package. This is included in the Pipfile
but must be installed with `pipenv install -dev`. Then, while still in the pyworkflow
directory, you can run

- `coverage run -m unittest tests/*.py`
- `coverage report` (to see a report via the CLI)
- `coverage html && open /htmlcov/index.html` (to view interactive coverage)
# Introduction
PyWorkflow was developed with a few key principles in mind:

1) Easily deployed. PyWorkflow can be deployed locally or remotely with pre-built
Docker containers.

2) Highly extensible. PyWorkflow has a few key nodes built-in to perform common
operations, but it is built with custom nodes in mind. Any user can write a
custom node of their own to perform *pandas* operations, or other data science
packages.

3) Advanced features for everyone. PyWorkflow is meant to cater to users with
no programming experience, all the way to someone who writes Python code daily.
An easy-to-use command line interface allows for batch workflow execution and
scheduled runs with a tool like `cron`.

To meet these principles, the user interface is built on
[react-diagrams](https://github.com/projectstorm/react-diagrams)
to enable drag-and-drop nodes and edge creation. These packaged nodes provide
basic *pandas* functionality and easy customization options for users to create
workflows tailored to their specific needs. For users looking to create custom
nodes, please [reference the documentation on how to write your own class](docs/custom_nodes.md).

On the back-end, a computational graph stores the nodes, edges, and
configuration options using the [NetworkX package](https://networkx.github.io).
All data operations are saved in JSON format which allows for easy readability
and transfer of data to other environments.

# Getting Started
The back-end consists of the PyWorkflow package, to perform all graph-based
operations, file storage/retrieval, and execution. These methods are triggered
either via API calls from the Django web-server, or from the CLI application.

The front-end is a SPA React app (bootstrapped with create-react-app). For React
to request data from Django, the `proxy` field is set in `front-end/package.json`,
telling the dev server to fetch non-static data from `localhost:8000` **where
the Django app must be running**.

## Docker

The easiest way to get started is by deploying both Docker containers on your
local machine. For help installing Docker, [reference the documentation for your
specific system](https://docs.docker.com/get-docker/).

The Docker container for PyWorkflow is built from 2 images: the `front-end` and
the `back-end`. The `docker-compose.yml` defines how to combine and run the two.

In order to build each image individually, from the root of the application:
- `docker build front-end --tag FE_IMAGE[:TAG]`
- `docker build back-end --tag BE_IMAGE[:TAG]`
ex. - `docker build back-end --tag backendtest:2.0`

Each individual image can be run by changing to the `front-end` or `back-end` directory and running:
- `docker run -p 3000:3000 --name FE_CONTAINER_NAME FE_IMAGE[:TAG]`
- `docker run -p 8000:8000 --name BE_CONTAINER_NAME BE_IMAGE[:TAG]`
ex. - `docker run -p 8000:8000 --name pyworkflow-be backendtest:2.0`

Note: there [is a known issue with `react-scripts` v3.4.1](https://github.com/facebook/create-react-app/issues/8688)
that may cause the front-end container to exit with code 0. If this happens,
you can add `-e CI=true` to the `docker-run` command above for the front-end.

To compose and run the entire application container, from the root of the application:
- `docker-compose up`

You can then kill the container gracefully with:
- `docker-compose down`

NOTE: For development, change ./front-end/package.json from "proxy": "http://back-end:8000" to "http://localhost:8000" to work.


## Serve locally

Alternatively, the front- and back-ends can be compiled separately and run on
your local machine.

### Server (Django)

1. Install `pipenv`

- **Homebrew**

```
brew install pipenv
```

- **pip**

```
pip install pipenv OR pip3 install pipenv
```
2. Install dependencies
Go to the `back-end` directory with `Pipfile` and `Pipfile.lock`.
```
cd back-end
pipenv install
```
3. Setup your local environment

- Create environment file with app secret
```
echo "SECRET_KEY='TEMPORARY SECRET KEY'" > vp/.environment
```

4. Start dev server from app root
```
cd vp
pipenv run python3 manage.py runserver
```

If you have trouble running commands individually, you can also enter the
virtual environment created by `pipenv` by running `pipenv shell`.

### Client (react-diagrams)
In a separate terminal window, perform the following steps to start the
front-end.

1. Install Prerequisites
```
cd front-end
npm install
```
2. Start dev server
```
npm start
```

# CLI
PyWorkflow also provides a command-line interface to execute pre-built workflows
without the client or server running. The CLI is packaged in the `back-end`
directory and can be accessed through a deployed Docker container, or locally
through the `pipenv shell`.

The CLI syntax for PyWorkflow is:
```
pyworkflow execute workflow-file...
```

For help reading from stdin, writing to stdout, batch-processing, and more
[check out the CLI docs](docs/cli.md) for more information.

# Tests
PyWorkflow has several automated tests that are run on each push to the GitHub
repository through GitHub Actions. The status of each can be seen in the various
badges at the top of this README.

PyWorkflow currently has unit tests for both the back-end (the PyWorkflow
package) and the front-end (react-diagrams). There are also API tests
using Postman to test the integration between the front- and back-ends. For more
information on these tests, and how to run them, [read the documentation for more
information](docs/tests.md).
80 changes: 80 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Command-line Interface

PyWorkflow is first-and-foremost a visual programming application, designed to
help data scientists and many others build workflows to view, manipulate, and
output their data into new formats. Therefore, all workflows must first be
created via the user-interface and saved for later execution.

However, it may not always be ideal to have the client and server deployed
locally or on a remote server just to run your workflows. Power-users want the
ability to running multiple workflows at once, schedule workflow runs, and
dynamically pass data from workflows via stdin/stdout in traditional shell
scripts. This is where the inclusion of PyWorkflow's CLI really shines.

## Command-line syntax

```
pyworkflow execute workflow-file...
```
### Commands

#### Execute
Accepts one or more workflow files as arguments to execute. PyWorkflow will load
the file(s) specified and output status messages to `stdout`. If a workflow
fails to run because of an exception, these will be logged to `stderr`.

**Single-file example**
```
pyworkflow execute ./workflows/my_workflow.json
```

**Batch processing**

Many shells offer different wildcards that can be used to work with multiple
files on the command line, or in scripts. A useful one is the `*` wildcard that
matches matches anything. Used in the following example, it has the effect of
passing all files located within the `workflows` directory to the `execute`
command.

```
pyworkflow execute ./workflows/*
```

## Using `stdin`/`stdout` to modify workflows

Two powerful tools when writing shell scripts are redirection and pipes, which
allow you to dynamically pass data from one command to another. Using these
tools, you can pass different data in to and out of workflows that define what
standard behavior should occur.

PyWorkflow comes with a Read CSV input node and Write CSV output node. When data
is provided via `stdin` on the command-line, it will modify the workflow
behavior to redirect the Read CSV node to that data. Similarly, if a destination
is specified for `stdout`, the Write CSV node output will be redirected there.

Input data can be passed to PyWorkflow in a few ways.
1) Redirection
```
# Data from sample_file.csv is passed to a Read CSV node
pyworkflow execute my_workflow.json < sample_file.csv
```
2) Pipes
```
# Two CSV files are combined and passed in to a Read CSV node
cat sample_file.csv more_data.csv | pyworkflow execute my_workflow.json

# Data from a 'csv_exporter' tool is passed to a Read CSV node
csv_exporter generate | pyworkflow execute my_workflow.json
```

Output data can be passed from PyWorkflow in a few ways.
1) Redirection
```
# Output from a Write CSV node is stored in a new file 'output.csv'
pyworkflow execute my_workflow.json > output.csv
```
2) Pipes
```
# Output from a Write CSV node is searched for the phrase 'foobar'
pyworkflow execute my_workflow.json | grep "foobar"
```
Loading