Skip to content

Latest commit

 

History

History
211 lines (147 loc) · 12.1 KB

HOW_TO.md

File metadata and controls

211 lines (147 loc) · 12.1 KB

How To Use This Template

This document describes how to use this transformer template for your custom algorithm processing plot-level RGB data.

Additional Information

Additional technical information can be found on our GitHub IO page for this repository.

If you are not familiar with working with an image's alpha channel, there is information and examples on how to do that on the technical page.

Assumptions

It is assumed that:

  • you are generating a Docker image containing your algorithm and that you have Docker installed on your computer
  • are familiar with GitHub template repositories, or know how to use git

Steps to take

The following steps can be taken to develop your algorithm for inclusion into a processing pipeline.

  1. Setup: Click the Use this template button in GitHub to make a copy of this repository (or run git clone)
  2. Definitions: Fill in and modify the definitions in the algorithm_rgb.py file
  3. Algorithm: Replace the code in the calculate function with your algorithm
  4. Generate: Run python3 generate.py to create a Dockerfile
  5. Test: Run the python3 testing.py script to run your algorithm and validate the results
  6. Docker: Create a Docker image for your algorithm and publish it
  7. Testing Your docker image: OPTIONAL
  8. Testing Image Production: OPTIONAL
  9. Finishing: Finish up your development efforts

Setup your repo

The first thing to do is to create a copy of this repository has a meaningful name and that you are able to modify. In GitHub this is easy, browse to this repository and click the Use this template button. You will be led through the steps necessary to create a clone in a location of your choosing.

If you are not on GitHub, you will need to setup your git environment and clone the repository.

Fill in your definitions

To fill in the needed definitions, first open the algorithm_rgb.py file in your favorite editor.

If you are modifying your existing code, you should consider updating the version number definition: VERSION. It's assumed that Semantic Version numbers will be used, but any methodology can be used.

Fill in the algorithm definitions with the creator(s) of the algorithm: ALGORITHM_AUTHOR, ALGORITHM_AUTHOR_EMAIL, ALGORITHM_NAME, and ALGORITHM_DESCRIPTION. Multiple names for ALGORITHM_AUTHOR and multiple emails for ALGORITHM_AUTHOR_EMAIL are supported. It's best if only one algorithm name is used, but call it what you want. The safest algorithm naming convention to use is to convert any white-space or other characters to periods (.) which allows different systems to more-easily change the name, if needed.

Next fill in the citation information that will be used in the generated CSV file: CITATION_AUTHOR, CITATION_TITLE, and CITATION_YEAR. Be sure to enter the citation information accurately since some systems may expect exact matches.

The names of the variables are used to determine the number of returned values your algorithm produces: VARIABLE_NAMES. Enter each variable name for each returned value, in the order they are returned, separated by a comma. Be sure to enter them accurately since some systems may expect exact matches. It is considered an error to have a mismatch between the number of variables names and the number of returned values.

A CSV file suitable for ingestion to BETYdb is generated depending upon the value of the WRITE_BETYDB_CSV variable. Setting this value to False will suppress the generation of this file by default.

A CSV file suitable for ingestion to TERRA REF Geostreams is generated depending upon the value of the WRITE_GEOSTREAMS_CSV variable. Setting this value to False will suppress the generation of this file by default.

Be sure to save your changes.

Add your algorithm

Open the algorithm_rgb.py file in your favorite editor, if it isn't opened already.

Scroll to the bottom of the file to the function named calculate.

Replace the comment starting with # ALGORITHM and the line below with your calculation(s). As needed, change the name of array used in your algorithm to the function's parameter pxarray.

Once you have your algorithm in place, replace the comment starting with # RETURN and the line below with your return values. Remember to order your return values to match the declared names in the VARIABLE_NAMES definition.

Modify the rest of the file as necessary if there are additional import statements, functions, classes, and other code needed by your algorithm.

Be sure to save your changes.

Generate the Docker build command file

It's time to generate the Dockerfile that's used to build Docker images.

Docker images can be used as part of a workflow and this step can be used to create that image for your algorithm.

To assist in this effort we've provided a script named generate.py to produce a file containing the Docker commands needed. Running this script will not only produce a Docker command file, named Dockerfile but also two other files that can be used to install additional dependencies your algorithm needs. These two other files are named requirements.txt for additional Python modules and packages.txt for other dependencies.

To generate these files, run python3 generate.py.

If your algorithm has additional python module dependencies, edit requirements.txt and add the names of the modules. The listed modules will then be installed as part of the Docker build process.

If there are other dependencies needed by your algorithm, add them to the packages.txt file. The packages listed will be installed using apt-get as part of the Docker build process.

Test your algorithm

A testing script named testing.py is provided for testing your algorithm. It checks whether the configuration is correct for testing the files by making sure that the arguments in algorithm_rgb as well as the image files are in the correct format

The testing script requires numpy and PIL to be installed on the testing system. The following command can be used to install these libraries:

python3 -m pip install numpy Pillow

If your testing files reside in a subfolder named test_images, the following command can be used to run the tests:

./testing.py ${PWD}/test_images

If your files reside in /user/myself/test_images the command to test could be the following:

python3 testing.py /user/myself/test_images

What isn't provided in the template repository are the plot-level RGB images to test against. It's expected that you will either provide the images or use a standard set that can be downloaded. The following commands can be used to retrieve and extract the test images:

# Create the subfolder to hold the images
mkdir test_images
# Download the archive containing the images
curl -X GET https://de.cyverse.org/dl/d/4108BB75-AAA3-48E1-BBD4-E10B06CADF54/sample_plot_images.zip -o test_images/sample_plot_images.zip
# Extract the images into the subfolder
unzip test_images/sample_plot_images.zip -d test_images/
# Remove the archive
rm test_images/sample_plot_images.zip

The testing script expects to have either a list of source plot image files, or a folder name, or both specified on the command line.

Create the Docker image

Now that you have generated your Dockerfile as described above and specified any Python modules and other packages needed by your algorithm, you are ready to create a Docker image of your algorithm.

You can build the Docker image with the following command:

docker build -t my_algorithm:latest ./

Please refer to the Docker documentation for additional information on building a docker image.

Once the image is built, you can run it locally or push it to an image repository, such as DockerHub. Please note that there may be naming requirements for pushing images to a repository.

(OPTIONAL) Using Docker to run testing.py

In order to test your docker image, you can use the command:

docker run --rm -it -v ${PWD}:/mnt --entrypoint /mnt/testing.py my_algorithm:latest /mnt/test_images

Breaking apart this command line, we have the following pieces:

  • docker run tells Docker to run an instance of the image (specified later in the command) (Refer to docker run documentation)
  • --rm tells Docker to remove the container (an image instance) when it's completed
  • -it allows you to have a stdin stream and terminal driver added to the docker container allowing an interactive session
  • -v ${PWD}:/mnt bind mounts a volume to the docker container so that the current working directory (given by ${PWD}) will be available
  • --entrypoint /mnt/testing.py overrides the default entrypoint of the Docker image tp run the testing script
  • my_algorithm:latest is the Docker image to run (the running image is known as a container)
  • /mnt/test_images specifies the location where the script finds the images to test with

Output should be in the format of image name and calculated value for that image on a single line for each of the images in the images folder. Example output from the images in the sample image set is shown below for the plot sample images folder, which is titled test_images:

/mnt/test_images/rgb_17_7_W.tif,7000
/mnt/test_images/rgb_40_11_W.tif,7000
/mnt/test_images/rgb_6_1_E.tif,7000
/mnt/test_images/rgb_1_2_E.tif,7000
/mnt/test_images/rgb_33_8_W.tif,7000
/mnt/test_images/rgb_5_11_W.tif,7000

(OPTIONAL) Production Testing of Image

Using the same image setup as used when testing your algorithm, a sample command line to run the image could be:

docker run --rm -v "${PWD}:/mnt" my_algorithm:latest --working_space "/mnt" "/mnt/test_images"

Breaking apart this command line, we have the following pieces:

  • docker run tells Docker to run an instance of the image (specified later in the command) (Refer to docker run documentation)
  • --rm tells Docker to remove the container (an image instance) when it's completed
  • -v "${PWD}:/mnt" specifies the ${PWD} path is to be made available as /mnt in the container
  • my_algorithm:latest is the image to run (the running image is known as a container)
  • --working_space "/mnt" lets the software in the container know where its working disk space is located; files are created here
  • "/mnt/test_images" specifies where the plot-level image files are located

The -v command line parameter is important since it allows the running container to access the local file system. The container can then load the images from the file system directly, without having to perform any copies. The parameters after the Docker image name are all relative to the target folder specified with this command line parameter.

Once the image files have been processed, the resulting CSV file(s) will be located in the current folder at ${PWD} (in this example).

The result.json file should tell you what errors were found in the checks from testing.py (make sure to check the output in the CSV file(s) even if the result.json file does not find errors)

Finishing up

Now that you're created your algorithm, there's a few more things to take care of:

  1. Make sure you've checked in your changes into source control; you don't want to lose all that hard work!
  2. Update the README.md file, filling out the sections with information on your algorithm; others will want to know so they can use it!
  3. Submit any requests to our ticketing system on GitHub: https://github.com/AgPipeline/computing-pipeline/issues/new/choose