Merge pull request #15 from MAAP-Project/issue-14-use-maappy-hotfixes

MAAP-Project · Aug 1, 2022 · 9d4d448 · 9d4d448
2 parents f899f1b + ec4b9ed
commit 9d4d448
Show file tree

Hide file tree

Showing 9 changed files with 166 additions and 76 deletions.
diff --git a/environment.yml b/environment.yml
@@ -14,4 +14,4 @@ dependencies:
   - shapely==1.8.2
   - pip
   - pip:
-      - git+https://github.com/MAAP-Project/maap-py.git@1bfa99d#egg=maappy
+      - git+https://github.com/MAAP-Project/maap-py.git@hotfixes#egg=maappy
diff --git a/gedi-subset/CHANGELOG.md b/gedi-subset/CHANGELOG.md
@@ -7,6 +7,17 @@ variation of [Semantic Versioning], with the following difference: each version
 is prefixed with `gedi-subset-` (e.g., `gedi-subset-0.1.0`) to allow for
 distinct lines of versioning of independent work in sibling directories.
 
+## [gedi-subset-0.2.2] - 2022-08-02
+
+### Changed
+
+- Updated `maap-py` dependency to use `hotfixes` branch until the library
+  employs proper release management.
+- Improved error-handling to add clarity around error messages stemming from
+  authentication errors and other HTTP request errors.
+- Enhanced S3 authentication to automatically use granule metadata containing an
+  online resource URL representing an S3 authentication endpoint.
+
 ## [gedi-subset-0.2.1] - 2022-06-07
 
 Hotfix replacement for `gedi-subset-0.2.0`.
@@ -25,12 +36,12 @@ Hotfix replacement for `gedi-subset-0.2.0`.
 
 ## [gedi-subset-0.2.0] - 2022-06-01 [YANKED]
 
-## Added
+### Added
 
 - Added inputs `columns` and `query` to refine filtering/subsetting.  See
   `gedi-subset/README.md` for details.
 
-## Changed
+### Changed
 
 - Improved performance of subsetting/filtering logic, resulting in ~5x speedup.
 

diff --git a/gedi-subset/README.md b/gedi-subset/README.md
@@ -3,9 +3,12 @@
 - [Algorithm Outline](#algorithm-outline)
 - [Algorithm Inputs](#algorithm-inputs)
 - [Running a GEDI Subsetting DPS Job](#running-a-gedi-subsetting-dps-job)
+  - [Submitting a DPS Job](#submitting-a-dps-job)
+  - [Checking the DPS Job Status](#checking-the-dps-job-status)
+  - [Getting the DPS Job Results](#getting-the-dps-job-results)
 - [Getting the GeoJSON URL for a geoBoundary](#getting-the-geojson-url-for-a-geoboundary)
 - [Contributing](#contributing)
-  - [Repository Setup](#repository-setup)
+  - [Development Setup](#development-setup)
   - [Creating an Algorithm Release](#creating-an-algorithm-release)
   - [Registering an Algorithm Release](#registering-an-algorithm-release)
 - [Citations](#citations)
@@ -19,7 +22,7 @@ At a high level, the GEDI subsetting algorithm does the following:
 - Downloads the data file (h5) for each intersecting granule (up to specified limit)
 - Subsets each data file
 - Combines all subset files into a single output file named `gedi_subset.gpkg`,
-  in GeoPackage format, which readable with `geopandas` as a `GeoDataFrame`.
+  in GeoPackage format, readable with `geopandas` as a `GeoDataFrame`.
 
 ## Algorithm Inputs
 
@@ -37,9 +40,9 @@ To run a GEDI subsetting DPS job, you must supply the following inputs:
 - `limit`: Maximum number of GEDI granule data files to download (among those
   that intersect the specified AOI).  (**Default:** 10,000)
 
-**IMPORTANT:** When supplying input values via the ADE UI, to accept a default
-input value, enter a dash (`-`) as the input value, otherwise the UI will show
-an error message if you leave any input blank.
+|**IMPORTANT**
+|:-------------
+|_When supplying input values (either via the ADE UI or programmatically, as shown in the next section), to use the default value (where indicated) for an input, enter a dash (`-`) as the input value, otherwise you will receive an error if you leave any input blank (or unspecified)._
 
 If your AOI is a publicly available geoBoundary, see
 [Getting the GeoJSON URL for a geoBoundary](#getting-the-geojson-url-for-a-geoboundary)
@@ -50,22 +53,28 @@ Alternatively, you can make your own GeoJSON file for your AOI and place it
 within your public bucket within the ADE.  Based upon where you place your
 GeoJSON file, you can construct a URL to specify for the a job's `aoi` input.
 
-Specifically, by placing your GeoJSON file at the following location within the
-ADE:
+Specifically, you should place your GeoJSON file at a location of the following
+form within the ADE (where `path/to/aio.geojson` can be any path and filename
+for your AOI):
 
 ```plain
 ~/my-public-bucket/path/to/aoi.geojson
+^^^^^^^^^^^^^^^^^^
 ```
 
-you would then supply the following URL as the `aoi` input value when running
+You would then supply the following URL as the `aoi` input value when running
 this algorithm as a DPS job, where `<USERNAME>` is your ADE username:
 
 ```plain
 https://maap-ops-workspace.s3.amazonaws.com/shared/<USERNAME>/path/to/aoi.geojson
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+      Replace "~/my-public-bucket" with this URL prefix
 ```
 
 ## Running a GEDI Subsetting DPS Job
 
+### Submitting a DPS Job
+
 The GEDI Subsetting DPS Job is named `gedi-subset_ubuntu`, and may be executed
 from your ADE Workspace by opening the **DPS/MAS Operations** menu, choosing
 the **Execute DPS Job** menu option, and selecting `gedi-subset_ubuntu:<VERSION>`
@@ -76,8 +85,6 @@ Alternatively, for greater control of your job configuration, you may use the
 MAAP API from a Notebook (or a Python script), as follows:
 
 ```python
-import uuid
-
 from maap.maap import MAAP
 
 maap = MAAP(maap_host='api.ops.maap-project.org')
@@ -96,14 +103,39 @@ result = maap.submitJob(
     limit=limit,
 )
 
-print(result["job_id"])
+job_id = result["job_id"]
+job_id
+```
+
+### Checking the DPS Job Status
+
+To check the status of your job via the ADE UI, open the **DPS/MAS Operations**
+menu, choose **Get DPS Job Status**, and enter the value of the `job_id` to
+obtain the status, just as if you had submitted the job from the menu (rather
+than programmatically).
+
+Alternatively, to programmatically check the status of the submitted job, you
+may run the following code.  If using a notebook, use a separate cell so you can
+run it repeatedly until you get a status of either `'Succeeded'` or `'Failed'`:
+
+```python
+import re
+
+# Should evaluate to 'Accepted', 'Running', 'Succeeded', or 'Failed'
+re.search(r"Status>(?P<status>.+)</wps:Status>", maap.getJobStatus(job_id).text).group('status')
 ```
 
-Using the value of `result["job_id"]`, you may use the **DPS/MAS Operations**
-menu for job operations, just as if you had submitted the job from the menu
-(rather than programmatically).  Once the job status is **Succeeded**, you must
-obtain the jobs result (**Get DPS Job Result**), which should display 3 URLs,
-with the first URL of the following form:
+### Getting the DPS Job Results
+
+Once the job status is either **Succeeded** or **Failed**, you may obtain the
+job result either via the UI (**DPS/MAS Operations** > **Get DPS Job Result**),
+or programmatically, but given that the programmatic results are in XML format,
+it will be difficult to read, so using the UI is ideal in this case.
+
+If the jobs status is **Failed**, the job results should show failure details.
+
+If the job status is **Succeeded**, the job results should show 3 URLs, with the
+first URL of the following form:
 
 ```plain
 http://.../<USERNAME>/dps_output/gedi-subset_ubuntu/<VERSION>/<DATETIME_PATH>
@@ -181,7 +213,7 @@ subsetting DPS job.
 
 ## Contributing
 
-### Repository Setup
+### Development Setup
 
 To contribute to this work, you must obtain access to the following:
 
@@ -195,7 +227,7 @@ To contribute to this work, you must obtain access to the following:
 
 To prepare for contributing, do the following in an ADE workspace:
 
-1. Clone the GitHub repository.
+1. Clone this GitHub repository.
 1. Change directory to the cloned repository.
 1. Add the GitLab repository as another remote (named `ade` here, but you may
    specify a different name for the remote):
@@ -204,8 +236,21 @@ To prepare for contributing, do the following in an ADE workspace:
    git remote add --tags -f ade https://repo.ops.maap-project.org/data-team/maap-documentation-examples.git
    ```
 
-If you plan to do any development work outside of the ADE (such as on your
-local workstation), perform the steps above in that location as well.
+1. Create the `gedi_subset` virtual environment by running the following
+   commands from within the repository directory (**NOTE:** _you will need to
+   repeat these steps whenever your restart your ADE workspace_):
+
+   ```bash
+   conda update conda -n base -c conda-forge -y
+   conda create -n gedi_subset python=3.10 -c conda-forge -y
+   conda activate gedi_subset
+   pip install -r gedi-subset/requirements-dev.txt
+   ```
+
+If you plan to do any development work outside of the ADE (such as on your local
+workstation), perform the steps above in that location as well.  **NOTE:** _This
+means that you must have `conda` installed (see [conda installation]) in your
+desired development location outside of the ADE workspace._
 
 During development, you will create PRs against the GitHub repository, as
 explained below.
@@ -282,6 +327,8 @@ Runfola, D. et al. (2020) geoBoundaries: A global database of political
 administrative boundaries.  PLoS ONE 15(4): e0231866.
 <https://doi.org/10.1371/journal.pone.0231866>
 
+[conda installation]:
+   https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
 [geoBoundaries]:
   https://www.geoboundaries.org
 [geoBoundaries API]:

diff --git a/gedi-subset/algorithm_config.yaml b/gedi-subset/algorithm_config.yaml
@@ -1,6 +1,6 @@
 description: Subset GEDI L4A granules within an area of interest (AOI)
 algo_name: gedi-subset
-version: gedi-subset-0.2.1
+version: gedi-subset-0.2.2
 environment: ubuntu
 repository_url: https://repo.ops.maap-project.org/data-team/maap-documentation-examples.git
 docker_url: mas.maap-project.org:5000/root/ade-base-images/r:latest

diff --git a/gedi-subset/fp.py b/gedi-subset/fp.py
@@ -10,6 +10,7 @@
 from typing import Callable, Iterable, TypeVar, cast
 
 from returns.curry import partial
+from returns.maybe import Maybe, Nothing, Some
 
 _A = TypeVar("_A")
 _B = TypeVar("_B")
@@ -26,7 +27,28 @@ def filter(predicate: Callable[[_A], bool]) -> Callable[[Iterable[_A]], Iterable
     return partial(builtins.filter, predicate)
 
 
-def K(a: _A) -> Callable[..., _A]:
+def find(predicate: Callable[[_A], bool]) -> Callable[[Iterable[_A]], Maybe[_A]]:
+    """Return a callable that accepts an iterable and returns the first item of the
+    iterable (in a `Some`) for which `predicate` returns `True`; otherwise `Nothing`.
+
+    >>> find(bool)([])
+    <Nothing>
+    >>> find(lambda x: x > 42)([19, 2, 42, 55, 45])
+    <Some: 55>
+    >>> find(lambda x: x > 99)([19, 2, 42, 55, 45])
+    <Nothing>
+    """
+
+    def go(xs: Iterable[_A]) -> Maybe[_A]:
+        for x in xs:
+            if predicate(x):
+                return Some(x)
+        return Nothing
+
+    return go
+
+
+def always(a: _A) -> Callable[..., _A]:
     """Return the kestrel combinator ("constant" function).
 
     Return a callable that accepts exactly one argument of any type, but always

diff --git a/gedi-subset/maapx.py b/gedi-subset/maapx.py
@@ -16,40 +16,39 @@
 import boto3
 from cachetools import FIFOCache, cached
 from cachetools.func import ttl_cache
-from fp import K
+from fp import always, find
 from maap.maap import MAAP
 from maap.Result import Collection, Granule
 from returns.curry import partial
 from returns.io import IOFailure, IOResultE, impure_safe
-from returns.maybe import Maybe, Nothing, Some, maybe
-from returns.pipeline import flow
-from returns.pointfree import bind, bind_result, lash, map_
-from returns.result import safe
+from returns.pipeline import flow, pipe
+from returns.pointfree import bind, bind_ioresult, bind_result, lash, map_
+from returns.result import ResultE, safe
 
 if TYPE_CHECKING:
     from maap.AWS import AWSCredentials
 
 logger = logging.getLogger(f"gedi_subset.{__name__}")
 
 
-# https://nasa-openscapes.github.io/2021-Cloud-Workshop-AGU/how-tos/Earthdata_Cloud__Single_File__Direct_S3_Access_COG_Example.html
-_S3_CREDENTIALS_ENDPOINT_BY_DAAC: Mapping[str, str] = dict(
-    po="https://archive.podaac.earthdata.nasa.gov/s3credentials",
-    gesdisc="https://data.gesdisc.earthdata.nasa.gov/s3credentials",
-    lp="https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
-    ornl="https://data.ornldaac.earthdata.nasa.gov/s3credentials",
-    ghrc="https://data.ghrc.earthdata.nasa.gov/s3credentials",
-)
+def _is_s3_credentials_online_resource(resource) -> bool:
+    url = resource.get("URL", "").lower()
+    description = resource.get("Description", "").lower()
 
+    # TODO: is this sufficient for identifying URL for obtaining S3 credentials?
+    return url and url.endswith("/s3credentials") or "credentials" in description
 
-def _s3_credentials_endpoint(download_url: str) -> Maybe[str]:
-    endpoints = [
-        endpoint
-        for key, endpoint in _S3_CREDENTIALS_ENDPOINT_BY_DAAC.items()
-        if key in download_url
-    ]
 
-    return Some(endpoints[0]) if endpoints else Nothing
+def _s3_credentials_endpoint(granule: Granule) -> ResultE[str]:
+    granule_ur = granule["Granule"]["GranuleUR"]
+    endpoint_error = ValueError(f"Granule {granule_ur} has no S3 credentials endpoint")
+
+    return flow(
+        granule.get("Granule", {}).get("OnlineResources", {}).get("OnlineResource", []),
+        find(_is_s3_credentials_online_resource),
+        bind(safe(operator.itemgetter("URL"))),
+        lash(always(IOFailure(endpoint_error))),
+    )
 
 
 @ttl_cache(ttl=55 * 60)
@@ -71,34 +70,37 @@ def _setup_default_boto3_session(creds: "AWSCredentials") -> boto3.session.Sessi
         aws_access_key_id=creds["accessKeyId"],
         aws_secret_access_key=creds["secretAccessKey"],
         aws_session_token=creds["sessionToken"],
+        # TODO: make this configurable
         region_name="us-west-2",
     )
 
 
 def download_granule(maap: MAAP, todir: str, granule: Granule) -> IOResultE[str]:
-    """Downloads a granule's data file.
+    """Download a granule's data file.
 
-    Automatically fetches S3 credentials appropriate for `granule`, based upon
+    Automatically fetch S3 credentials appropriate for `granule`, based upon
     it's S3 URL, and automatically refreshes credentials before expiry.
 
     Return `IOSuccess[str]` containing the absolute path of the downloaded file
     upon success; otherwise return `IOFailure[Exception]` containing the reason
     for failure.
     """
-    ur: str = granule["Granule"]["GranuleUR"]
+    granule_ur = granule["Granule"]["GranuleUR"]
+    logger.debug(f"Downloading granule {granule_ur} to directory {todir}")
 
-    flow(
-        maybe(granule.getDownloadUrl)(),
-        lash(K(IOFailure(ValueError(f"Missing download URL for granule {ur}")))),
-        bind(_s3_credentials_endpoint),
+    return flow(
+        _s3_credentials_endpoint(granule),
         bind(partial(_earthdata_s3_credentials, maap)),
         map_(_setup_default_boto3_session),
+        # We don't need to directly use the boto3 session object, so we discard it by
+        # mapping to the constant `todir`.  We simply want to download the granule file
+        # to `todir` after the S3 credentials are obtained and applied to the boto3
+        # default session.  If obtaining S3 credentials fails, we bypass the download
+        # attempt and return the failure.
+        map_(always(todir)),
+        bind(impure_safe(granule.getData)),
     )
 
-    logger.debug(f"Downloading granule {ur} to directory {todir}")
-
-    return impure_safe(granule.getData)(todir)
-
 
 def find_collection(
     maap: MAAP,
@@ -111,8 +113,14 @@ def find_collection(
     search; otherwise return `IOFailure[Exception]` containing the reason for
     failure, which is a `ValueError` when there is no matching collection.
     """
+    not_found_error = ValueError(f"No collection found at {cmr_host}: {params}")
+
     return flow(
         impure_safe(maap.searchCollection)(cmr_host=cmr_host, limit=1, **params),
-        bind_result(safe(operator.itemgetter(0))),
-        lash(K(IOFailure(ValueError(f"No collection found at {cmr_host}: {params}")))),
+        bind_ioresult(
+            pipe(
+                safe(operator.itemgetter(0)),
+                lash(always(IOFailure(not_found_error))),
+            )
+        ),
     )
diff --git a/gedi-subset/requirements.txt b/gedi-subset/requirements.txt
@@ -7,4 +7,4 @@ pyarrow==8.0.0
 returns==0.19.0
 shapely==1.8.2
 typer==0.4.1
-git+https://github.com/MAAP-Project/maap-py.git@1bfa99d#egg=maappy
+git+https://github.com/MAAP-Project/maap-py.git@hotfixes#egg=maappy