diff --git a/README.md b/README.md index 90d589e..ad24742 100644 --- a/README.md +++ b/README.md @@ -1,96 +1,168 @@ -# Introduction -![Postman Tests](https://github.com/matthew-t-smith/visual-programming/workflows/Postman%20Tests/badge.svg) -![Code Coverage](./docs/media/coverage.svg) +# PyWorkflow +| | | +|------------|--------| +| Docker | TBD | +| Back-end | ![Postman Tests](https://github.com/matthew-t-smith/visual-programming/workflows/Postman%20Tests/badge.svg) | +| Front-end | TBD | +| PyWorkflow | ![Code Coverage](./docs/media/pyworkflow_coverage.svg) | +| CLI | TBD | +| Jest | TBD | + +PyWorkflow is a visual programming application for building data science +pipelines and workflows. It is inspired by [KNIME](https://www.knime.com) +and aims to bring the desktop-based experience to a web-based environment. +PyWorkflow takes a Python-first approach and leverages the power of *pandas* +DataFrames to bring data-science to the masses. ![Pyworkflow UI](./docs/media/pyworkflow-ui.png) -So far the app comprises a Django app and a SPA React app (bootstrapped with -create-react-app). For React to request data from Django, the `proxy` field is -set in `front-end/package.json`, telling the dev server to fetch non-static -data from `localhost:8000` **where the Django app must be running**. - -## Django - -### Install Dependencies -1. Install `pipenv` from home directory - - - **Homebrew**: - - - `brew install pipenv` - - - **pip**: - - - `pip install pipenv` - - or depending on your versioning setup: - - `pip3 install pipenv` - - - You can install at the User level using **pip** via: `pip install --user pipenv` - -2. `cd` to top level of project (contains `Pipfile` and `Pipefile.lock`) - -3. Install dependencies - - - `pipenv install` - -4. Activate and exit the shell - - - `pipenv shell` - - `exit` - -5. Or, run single commands - - - `pipenv run python [COMMAND]` - -### Installing new packages -- Simply install via: `pipenv install [package-name]` - -### Create dotenv file with app secret -- `echo "SECRET_KEY='TEMPORARY SECRET KEY'" > vp/.environment` - -### Start dev server from app root -- `cd vp` -- `pipenv run python manage.py runserver` - ---- -## React - -### Install Prerequisites -- `cd front-end` -- `npm install` - -### Start dev server -- `npm start` - ---- -## CLI -1. Run pipenv shell. -2. Create a workflow using UI and save it. -3. Run it as: pyworkflow execute workflow-file - -Also accepts reading input from std (i.e < file.csv) and writing to sdt out (i.e > output.csv) - - - ---- -## Tests -PyWorkflow currently has two sets of tests: API endpoints and unit tests. -The API tests are written in Postman and can be run individually, by importing -the collection and environment into your Postman application, or via the command -line by [installing Newman](https://www.npmjs.com/package/newman) and running: - -- `cd Postman` -- `newman run PyWorkflow-runner.postman_collection.json --environment Local-env.postman_environment.json` - -Unit tests for the PyWorkflow package are run using Python's built-in `unittest` -package. - -- `cd pyworkflow/pyworkflow` -- `pipenv run python3 -m unittest tests/*.py` - -To see coverage, you can use the `coverage` package. This is included in the Pipfile -but must be installed with `pipenv install -dev`. Then, while still in the pyworkflow -directory, you can run - -- `coverage run -m unittest tests/*.py` -- `coverage report` (to see a report via the CLI) -- `coverage html && open /htmlcov/index.html` (to view interactive coverage) +# Introduction +PyWorkflow was developed with a few key principles in mind: + +1) Easily deployed. PyWorkflow can be deployed locally or remotely with pre-built +Docker containers. + +2) Highly extensible. PyWorkflow has a few key nodes built-in to perform common +operations, but it is built with custom nodes in mind. Any user can write a +custom node of their own to perform *pandas* operations, or other data science +packages. + +3) Advanced features for everyone. PyWorkflow is meant to cater to users with +no programming experience, all the way to someone who writes Python code daily. +An easy-to-use command line interface allows for batch workflow execution and +scheduled runs with a tool like `cron`. + +To meet these principles, the user interface is built on +[react-diagrams](https://github.com/projectstorm/react-diagrams) +to enable drag-and-drop nodes and edge creation. These packaged nodes provide +basic *pandas* functionality and easy customization options for users to create +workflows tailored to their specific needs. For users looking to create custom +nodes, please [reference the documentation on how to write your own class](docs/custom_nodes.md). + +On the back-end, a computational graph stores the nodes, edges, and +configuration options using the [NetworkX package](https://networkx.github.io). +All data operations are saved in JSON format which allows for easy readability +and transfer of data to other environments. + +# Getting Started +The back-end consists of the PyWorkflow package, to perform all graph-based +operations, file storage/retrieval, and execution. These methods are triggered +either via API calls from the Django web-server, or from the CLI application. + +The front-end is a SPA React app (bootstrapped with create-react-app). For React +to request data from Django, the `proxy` field is set in `front-end/package.json`, +telling the dev server to fetch non-static data from `localhost:8000` **where +the Django app must be running**. + +## Docker + +The easiest way to get started is by deploying both Docker containers on your +local machine. For help installing Docker, [reference the documentation for your +specific system](https://docs.docker.com/get-docker/). + +The Docker container for PyWorkflow is built from 2 images: the `front-end` and +the `back-end`. The `docker-compose.yml` defines how to combine and run the two. + +In order to build each image individually, from the root of the application: +- `docker build front-end --tag FE_IMAGE[:TAG]` +- `docker build back-end --tag BE_IMAGE[:TAG]` + ex. - `docker build back-end --tag backendtest:2.0` + +Each individual image can be run by changing to the `front-end` or `back-end` directory and running: +- `docker run -p 3000:3000 --name FE_CONTAINER_NAME FE_IMAGE[:TAG]` +- `docker run -p 8000:8000 --name BE_CONTAINER_NAME BE_IMAGE[:TAG]` + ex. - `docker run -p 8000:8000 --name pyworkflow-be backendtest:2.0` + +Note: there [is a known issue with `react-scripts` v3.4.1](https://github.com/facebook/create-react-app/issues/8688) +that may cause the front-end container to exit with code 0. If this happens, +you can add `-e CI=true` to the `docker-run` command above for the front-end. + +To compose and run the entire application container, from the root of the application: +- `docker-compose up` + +You can then kill the container gracefully with: +- `docker-compose down` + +NOTE: For development, change ./front-end/package.json from "proxy": "http://back-end:8000" to "http://localhost:8000" to work. + + +## Serve locally + +Alternatively, the front- and back-ends can be compiled separately and run on +your local machine. + +### Server (Django) + +1. Install `pipenv` + +- **Homebrew** + +``` +brew install pipenv +``` + +- **pip** + +``` +pip install pipenv OR pip3 install pipenv +``` +2. Install dependencies +Go to the `back-end` directory with `Pipfile` and `Pipfile.lock`. +``` +cd back-end +pipenv install +``` +3. Setup your local environment + +- Create environment file with app secret +``` +echo "SECRET_KEY='TEMPORARY SECRET KEY'" > vp/.environment +``` + +4. Start dev server from app root +``` +cd vp +pipenv run python3 manage.py runserver +``` + +If you have trouble running commands individually, you can also enter the +virtual environment created by `pipenv` by running `pipenv shell`. + +### Client (react-diagrams) +In a separate terminal window, perform the following steps to start the +front-end. + +1. Install Prerequisites +``` +cd front-end +npm install +``` +2. Start dev server +``` +npm start +``` + +# CLI +PyWorkflow also provides a command-line interface to execute pre-built workflows +without the client or server running. The CLI is packaged in the `back-end` +directory and can be accessed through a deployed Docker container, or locally +through the `pipenv shell`. + +The CLI syntax for PyWorkflow is: +``` +pyworkflow execute workflow-file... +``` + +For help reading from stdin, writing to stdout, batch-processing, and more +[check out the CLI docs](docs/cli.md) for more information. + +# Tests +PyWorkflow has several automated tests that are run on each push to the GitHub +repository through GitHub Actions. The status of each can be seen in the various +badges at the top of this README. + +PyWorkflow currently has unit tests for both the back-end (the PyWorkflow +package) and the front-end (react-diagrams). There are also API tests +using Postman to test the integration between the front- and back-ends. For more +information on these tests, and how to run them, [read the documentation for more +information](docs/tests.md). diff --git a/docs/cli.md b/docs/cli.md new file mode 100644 index 0000000..3236359 --- /dev/null +++ b/docs/cli.md @@ -0,0 +1,80 @@ +# Command-line Interface + +PyWorkflow is first-and-foremost a visual programming application, designed to +help data scientists and many others build workflows to view, manipulate, and +output their data into new formats. Therefore, all workflows must first be +created via the user-interface and saved for later execution. + +However, it may not always be ideal to have the client and server deployed +locally or on a remote server just to run your workflows. Power-users want the +ability to running multiple workflows at once, schedule workflow runs, and +dynamically pass data from workflows via stdin/stdout in traditional shell +scripts. This is where the inclusion of PyWorkflow's CLI really shines. + +## Command-line syntax + +``` +pyworkflow execute workflow-file... +``` +### Commands + +#### Execute +Accepts one or more workflow files as arguments to execute. PyWorkflow will load +the file(s) specified and output status messages to `stdout`. If a workflow +fails to run because of an exception, these will be logged to `stderr`. + +**Single-file example** +``` +pyworkflow execute ./workflows/my_workflow.json +``` + +**Batch processing** + +Many shells offer different wildcards that can be used to work with multiple +files on the command line, or in scripts. A useful one is the `*` wildcard that +matches matches anything. Used in the following example, it has the effect of +passing all files located within the `workflows` directory to the `execute` +command. + +``` +pyworkflow execute ./workflows/* +``` + +## Using `stdin`/`stdout` to modify workflows + +Two powerful tools when writing shell scripts are redirection and pipes, which +allow you to dynamically pass data from one command to another. Using these +tools, you can pass different data in to and out of workflows that define what +standard behavior should occur. + +PyWorkflow comes with a Read CSV input node and Write CSV output node. When data +is provided via `stdin` on the command-line, it will modify the workflow +behavior to redirect the Read CSV node to that data. Similarly, if a destination +is specified for `stdout`, the Write CSV node output will be redirected there. + +Input data can be passed to PyWorkflow in a few ways. +1) Redirection +``` +# Data from sample_file.csv is passed to a Read CSV node +pyworkflow execute my_workflow.json < sample_file.csv +``` +2) Pipes +``` +# Two CSV files are combined and passed in to a Read CSV node +cat sample_file.csv more_data.csv | pyworkflow execute my_workflow.json + +# Data from a 'csv_exporter' tool is passed to a Read CSV node +csv_exporter generate | pyworkflow execute my_workflow.json +``` + +Output data can be passed from PyWorkflow in a few ways. +1) Redirection +``` +# Output from a Write CSV node is stored in a new file 'output.csv' +pyworkflow execute my_workflow.json > output.csv +``` +2) Pipes +``` +# Output from a Write CSV node is searched for the phrase 'foobar' +pyworkflow execute my_workflow.json | grep "foobar" +``` diff --git a/docs/custom_nodes.md b/docs/custom_nodes.md new file mode 100644 index 0000000..34fd9f9 --- /dev/null +++ b/docs/custom_nodes.md @@ -0,0 +1,187 @@ +# Custom Nodes +The power of PyWorkflow comes from its support for custom nodes. New data +science and other Python packages are being constantly developed. With custom +nodes, you can write workflows tailored to your specific needs and packages +needed for your specific field. + +Custom nodes were designed to be easily written and greatly expandable. You +don't need to worry about React, Django, or any specifics of PyWorkflow to get +started. All you need is: + +1) Create a `.py` file that subclass the main Node class. +2) Add any parameters you need for your node might need for execution. +3) Write an `execute()` method using your package of choice. +4) That's it! + +The rest is handled for you, from flow variable overrides, to input data from +other nodes in the workflow. + +# Getting started +A custom node will look something like the following. +```python +from pyworkflow.node import Node, NodeException +from pyworkflow.parameters import * +import pandas as pd + + +class MyCustomNode(Node): + name = "My Node Name" + num_in = 1 + num_out = 1 + + OPTIONS = { + "input": StringParameter( + "My Input Parameter", + default="", + docstring="A place to provide input" + ) + } + + def execute(self, predecessor_data, flow_vars): + try: + # Do custom node operations here + my_json_data = {"message": flow_vars["input"].get_value()} + return my_json_data + except Exception as e: + raise NodeException('my_node', str(e)) +``` + +Let's break it down to see how you can take this example and make your own +custom node! + +## Imports +All custom nodes require a few classes defined by the PyWorkflow package. In the +example above, we import `Node`, `NodeException`, and all (`*`) classes from +the `parameters.py` file. If you take a look at `pyworkflow/node.py`, you'll see +there's several subclasses defined in addition to `Node`. These classes are +described in their docstring comments and include: +- FlowNode: for flow variable parameter overrides +- IONode: for reading/writing data +- ManipulationNode: for altering data +- VizNode: for visualizing data with graphs, charts, etc. + +In the example above, we subclass the main `Node` class, but you can also +import/subclass one of the others mentioned above depending on your use case. + +The final line, `import pandas as pd` is important as all PyWorkflow nodes use +a pandas DataFrame as the atom of data representation. If your custom node +reads or writes data, it must start or end with a pandas DataFrame. + +## Class attributes +You'll see there are three class-level attributes defined in the example above. +This information is used by both the front- and back-ends to properly display +and validate your custom node. The attributes are: +- `name`: The display name you want your node to have in the UI. +- `num_in`: The number of 'in' ports your node accepts. +- `num_out`: The number of 'out' ports your node accepts. + +## Parameter options +The next part of the example is the `OPTIONS` dictionary that defines any number +of parameters your custom node might need for execution. You can find out more +about the different parameter types in `pyworkflow/parameters.py`, but for a +general overview, there is: +- `FileParameter`: accepts a file-upload in the configuration form +- `StringParameter`: accepts any string input, displayed as `` +- `TextParameter`: accepts any string input, but displayed as an HTML `