Skip to content

Commit

Permalink
Merge pull request #15 from MAAP-Project/issue-14-use-maappy-hotfixes
Browse files Browse the repository at this point in the history
  • Loading branch information
chuckwondo authored Aug 1, 2022
2 parents f899f1b + ec4b9ed commit 9d4d448
Show file tree
Hide file tree
Showing 9 changed files with 166 additions and 76 deletions.
2 changes: 1 addition & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ dependencies:
- shapely==1.8.2
- pip
- pip:
- git+https://github.com/MAAP-Project/maap-py.git@1bfa99d#egg=maappy
- git+https://github.com/MAAP-Project/maap-py.git@hotfixes#egg=maappy
15 changes: 13 additions & 2 deletions gedi-subset/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,17 @@ variation of [Semantic Versioning], with the following difference: each version
is prefixed with `gedi-subset-` (e.g., `gedi-subset-0.1.0`) to allow for
distinct lines of versioning of independent work in sibling directories.

## [gedi-subset-0.2.2] - 2022-08-02

### Changed

- Updated `maap-py` dependency to use `hotfixes` branch until the library
employs proper release management.
- Improved error-handling to add clarity around error messages stemming from
authentication errors and other HTTP request errors.
- Enhanced S3 authentication to automatically use granule metadata containing an
online resource URL representing an S3 authentication endpoint.

## [gedi-subset-0.2.1] - 2022-06-07

Hotfix replacement for `gedi-subset-0.2.0`.
Expand All @@ -25,12 +36,12 @@ Hotfix replacement for `gedi-subset-0.2.0`.

## [gedi-subset-0.2.0] - 2022-06-01 [YANKED]

## Added
### Added

- Added inputs `columns` and `query` to refine filtering/subsetting. See
`gedi-subset/README.md` for details.

## Changed
### Changed

- Improved performance of subsetting/filtering logic, resulting in ~5x speedup.

Expand Down
87 changes: 67 additions & 20 deletions gedi-subset/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,12 @@
- [Algorithm Outline](#algorithm-outline)
- [Algorithm Inputs](#algorithm-inputs)
- [Running a GEDI Subsetting DPS Job](#running-a-gedi-subsetting-dps-job)
- [Submitting a DPS Job](#submitting-a-dps-job)
- [Checking the DPS Job Status](#checking-the-dps-job-status)
- [Getting the DPS Job Results](#getting-the-dps-job-results)
- [Getting the GeoJSON URL for a geoBoundary](#getting-the-geojson-url-for-a-geoboundary)
- [Contributing](#contributing)
- [Repository Setup](#repository-setup)
- [Development Setup](#development-setup)
- [Creating an Algorithm Release](#creating-an-algorithm-release)
- [Registering an Algorithm Release](#registering-an-algorithm-release)
- [Citations](#citations)
Expand All @@ -19,7 +22,7 @@ At a high level, the GEDI subsetting algorithm does the following:
- Downloads the data file (h5) for each intersecting granule (up to specified limit)
- Subsets each data file
- Combines all subset files into a single output file named `gedi_subset.gpkg`,
in GeoPackage format, which readable with `geopandas` as a `GeoDataFrame`.
in GeoPackage format, readable with `geopandas` as a `GeoDataFrame`.

## Algorithm Inputs

Expand All @@ -37,9 +40,9 @@ To run a GEDI subsetting DPS job, you must supply the following inputs:
- `limit`: Maximum number of GEDI granule data files to download (among those
that intersect the specified AOI). (**Default:** 10,000)

**IMPORTANT:** When supplying input values via the ADE UI, to accept a default
input value, enter a dash (`-`) as the input value, otherwise the UI will show
an error message if you leave any input blank.
|**IMPORTANT**
|:-------------
|_When supplying input values (either via the ADE UI or programmatically, as shown in the next section), to use the default value (where indicated) for an input, enter a dash (`-`) as the input value, otherwise you will receive an error if you leave any input blank (or unspecified)._

If your AOI is a publicly available geoBoundary, see
[Getting the GeoJSON URL for a geoBoundary](#getting-the-geojson-url-for-a-geoboundary)
Expand All @@ -50,22 +53,28 @@ Alternatively, you can make your own GeoJSON file for your AOI and place it
within your public bucket within the ADE. Based upon where you place your
GeoJSON file, you can construct a URL to specify for the a job's `aoi` input.

Specifically, by placing your GeoJSON file at the following location within the
ADE:
Specifically, you should place your GeoJSON file at a location of the following
form within the ADE (where `path/to/aio.geojson` can be any path and filename
for your AOI):

```plain
~/my-public-bucket/path/to/aoi.geojson
^^^^^^^^^^^^^^^^^^
```

you would then supply the following URL as the `aoi` input value when running
You would then supply the following URL as the `aoi` input value when running
this algorithm as a DPS job, where `<USERNAME>` is your ADE username:

```plain
https://maap-ops-workspace.s3.amazonaws.com/shared/<USERNAME>/path/to/aoi.geojson
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Replace "~/my-public-bucket" with this URL prefix
```

## Running a GEDI Subsetting DPS Job

### Submitting a DPS Job

The GEDI Subsetting DPS Job is named `gedi-subset_ubuntu`, and may be executed
from your ADE Workspace by opening the **DPS/MAS Operations** menu, choosing
the **Execute DPS Job** menu option, and selecting `gedi-subset_ubuntu:<VERSION>`
Expand All @@ -76,8 +85,6 @@ Alternatively, for greater control of your job configuration, you may use the
MAAP API from a Notebook (or a Python script), as follows:

```python
import uuid

from maap.maap import MAAP

maap = MAAP(maap_host='api.ops.maap-project.org')
Expand All @@ -96,14 +103,39 @@ result = maap.submitJob(
limit=limit,
)

print(result["job_id"])
job_id = result["job_id"]
job_id
```

### Checking the DPS Job Status

To check the status of your job via the ADE UI, open the **DPS/MAS Operations**
menu, choose **Get DPS Job Status**, and enter the value of the `job_id` to
obtain the status, just as if you had submitted the job from the menu (rather
than programmatically).

Alternatively, to programmatically check the status of the submitted job, you
may run the following code. If using a notebook, use a separate cell so you can
run it repeatedly until you get a status of either `'Succeeded'` or `'Failed'`:

```python
import re

# Should evaluate to 'Accepted', 'Running', 'Succeeded', or 'Failed'
re.search(r"Status>(?P<status>.+)</wps:Status>", maap.getJobStatus(job_id).text).group('status')
```

Using the value of `result["job_id"]`, you may use the **DPS/MAS Operations**
menu for job operations, just as if you had submitted the job from the menu
(rather than programmatically). Once the job status is **Succeeded**, you must
obtain the jobs result (**Get DPS Job Result**), which should display 3 URLs,
with the first URL of the following form:
### Getting the DPS Job Results

Once the job status is either **Succeeded** or **Failed**, you may obtain the
job result either via the UI (**DPS/MAS Operations** > **Get DPS Job Result**),
or programmatically, but given that the programmatic results are in XML format,
it will be difficult to read, so using the UI is ideal in this case.

If the jobs status is **Failed**, the job results should show failure details.

If the job status is **Succeeded**, the job results should show 3 URLs, with the
first URL of the following form:

```plain
http://.../<USERNAME>/dps_output/gedi-subset_ubuntu/<VERSION>/<DATETIME_PATH>
Expand Down Expand Up @@ -181,7 +213,7 @@ subsetting DPS job.

## Contributing

### Repository Setup
### Development Setup

To contribute to this work, you must obtain access to the following:

Expand All @@ -195,7 +227,7 @@ To contribute to this work, you must obtain access to the following:

To prepare for contributing, do the following in an ADE workspace:

1. Clone the GitHub repository.
1. Clone this GitHub repository.
1. Change directory to the cloned repository.
1. Add the GitLab repository as another remote (named `ade` here, but you may
specify a different name for the remote):
Expand All @@ -204,8 +236,21 @@ To prepare for contributing, do the following in an ADE workspace:
git remote add --tags -f ade https://repo.ops.maap-project.org/data-team/maap-documentation-examples.git
```

If you plan to do any development work outside of the ADE (such as on your
local workstation), perform the steps above in that location as well.
1. Create the `gedi_subset` virtual environment by running the following
commands from within the repository directory (**NOTE:** _you will need to
repeat these steps whenever your restart your ADE workspace_):

```bash
conda update conda -n base -c conda-forge -y
conda create -n gedi_subset python=3.10 -c conda-forge -y
conda activate gedi_subset
pip install -r gedi-subset/requirements-dev.txt
```

If you plan to do any development work outside of the ADE (such as on your local
workstation), perform the steps above in that location as well. **NOTE:** _This
means that you must have `conda` installed (see [conda installation]) in your
desired development location outside of the ADE workspace._

During development, you will create PRs against the GitHub repository, as
explained below.
Expand Down Expand Up @@ -282,6 +327,8 @@ Runfola, D. et al. (2020) geoBoundaries: A global database of political
administrative boundaries. PLoS ONE 15(4): e0231866.
<https://doi.org/10.1371/journal.pone.0231866>

[conda installation]:
https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
[geoBoundaries]:
https://www.geoboundaries.org
[geoBoundaries API]:
Expand Down
2 changes: 1 addition & 1 deletion gedi-subset/algorithm_config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
description: Subset GEDI L4A granules within an area of interest (AOI)
algo_name: gedi-subset
version: gedi-subset-0.2.1
version: gedi-subset-0.2.2
environment: ubuntu
repository_url: https://repo.ops.maap-project.org/data-team/maap-documentation-examples.git
docker_url: mas.maap-project.org:5000/root/ade-base-images/r:latest
Expand Down
24 changes: 23 additions & 1 deletion gedi-subset/fp.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from typing import Callable, Iterable, TypeVar, cast

from returns.curry import partial
from returns.maybe import Maybe, Nothing, Some

_A = TypeVar("_A")
_B = TypeVar("_B")
Expand All @@ -26,7 +27,28 @@ def filter(predicate: Callable[[_A], bool]) -> Callable[[Iterable[_A]], Iterable
return partial(builtins.filter, predicate)


def K(a: _A) -> Callable[..., _A]:
def find(predicate: Callable[[_A], bool]) -> Callable[[Iterable[_A]], Maybe[_A]]:
"""Return a callable that accepts an iterable and returns the first item of the
iterable (in a `Some`) for which `predicate` returns `True`; otherwise `Nothing`.
>>> find(bool)([])
<Nothing>
>>> find(lambda x: x > 42)([19, 2, 42, 55, 45])
<Some: 55>
>>> find(lambda x: x > 99)([19, 2, 42, 55, 45])
<Nothing>
"""

def go(xs: Iterable[_A]) -> Maybe[_A]:
for x in xs:
if predicate(x):
return Some(x)
return Nothing

return go


def always(a: _A) -> Callable[..., _A]:
"""Return the kestrel combinator ("constant" function).
Return a callable that accepts exactly one argument of any type, but always
Expand Down
74 changes: 41 additions & 33 deletions gedi-subset/maapx.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,40 +16,39 @@
import boto3
from cachetools import FIFOCache, cached
from cachetools.func import ttl_cache
from fp import K
from fp import always, find
from maap.maap import MAAP
from maap.Result import Collection, Granule
from returns.curry import partial
from returns.io import IOFailure, IOResultE, impure_safe
from returns.maybe import Maybe, Nothing, Some, maybe
from returns.pipeline import flow
from returns.pointfree import bind, bind_result, lash, map_
from returns.result import safe
from returns.pipeline import flow, pipe
from returns.pointfree import bind, bind_ioresult, bind_result, lash, map_
from returns.result import ResultE, safe

if TYPE_CHECKING:
from maap.AWS import AWSCredentials

logger = logging.getLogger(f"gedi_subset.{__name__}")


# https://nasa-openscapes.github.io/2021-Cloud-Workshop-AGU/how-tos/Earthdata_Cloud__Single_File__Direct_S3_Access_COG_Example.html
_S3_CREDENTIALS_ENDPOINT_BY_DAAC: Mapping[str, str] = dict(
po="https://archive.podaac.earthdata.nasa.gov/s3credentials",
gesdisc="https://data.gesdisc.earthdata.nasa.gov/s3credentials",
lp="https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
ornl="https://data.ornldaac.earthdata.nasa.gov/s3credentials",
ghrc="https://data.ghrc.earthdata.nasa.gov/s3credentials",
)
def _is_s3_credentials_online_resource(resource) -> bool:
url = resource.get("URL", "").lower()
description = resource.get("Description", "").lower()

# TODO: is this sufficient for identifying URL for obtaining S3 credentials?
return url and url.endswith("/s3credentials") or "credentials" in description

def _s3_credentials_endpoint(download_url: str) -> Maybe[str]:
endpoints = [
endpoint
for key, endpoint in _S3_CREDENTIALS_ENDPOINT_BY_DAAC.items()
if key in download_url
]

return Some(endpoints[0]) if endpoints else Nothing
def _s3_credentials_endpoint(granule: Granule) -> ResultE[str]:
granule_ur = granule["Granule"]["GranuleUR"]
endpoint_error = ValueError(f"Granule {granule_ur} has no S3 credentials endpoint")

return flow(
granule.get("Granule", {}).get("OnlineResources", {}).get("OnlineResource", []),
find(_is_s3_credentials_online_resource),
bind(safe(operator.itemgetter("URL"))),
lash(always(IOFailure(endpoint_error))),
)


@ttl_cache(ttl=55 * 60)
Expand All @@ -71,34 +70,37 @@ def _setup_default_boto3_session(creds: "AWSCredentials") -> boto3.session.Sessi
aws_access_key_id=creds["accessKeyId"],
aws_secret_access_key=creds["secretAccessKey"],
aws_session_token=creds["sessionToken"],
# TODO: make this configurable
region_name="us-west-2",
)


def download_granule(maap: MAAP, todir: str, granule: Granule) -> IOResultE[str]:
"""Downloads a granule's data file.
"""Download a granule's data file.
Automatically fetches S3 credentials appropriate for `granule`, based upon
Automatically fetch S3 credentials appropriate for `granule`, based upon
it's S3 URL, and automatically refreshes credentials before expiry.
Return `IOSuccess[str]` containing the absolute path of the downloaded file
upon success; otherwise return `IOFailure[Exception]` containing the reason
for failure.
"""
ur: str = granule["Granule"]["GranuleUR"]
granule_ur = granule["Granule"]["GranuleUR"]
logger.debug(f"Downloading granule {granule_ur} to directory {todir}")

flow(
maybe(granule.getDownloadUrl)(),
lash(K(IOFailure(ValueError(f"Missing download URL for granule {ur}")))),
bind(_s3_credentials_endpoint),
return flow(
_s3_credentials_endpoint(granule),
bind(partial(_earthdata_s3_credentials, maap)),
map_(_setup_default_boto3_session),
# We don't need to directly use the boto3 session object, so we discard it by
# mapping to the constant `todir`. We simply want to download the granule file
# to `todir` after the S3 credentials are obtained and applied to the boto3
# default session. If obtaining S3 credentials fails, we bypass the download
# attempt and return the failure.
map_(always(todir)),
bind(impure_safe(granule.getData)),
)

logger.debug(f"Downloading granule {ur} to directory {todir}")

return impure_safe(granule.getData)(todir)


def find_collection(
maap: MAAP,
Expand All @@ -111,8 +113,14 @@ def find_collection(
search; otherwise return `IOFailure[Exception]` containing the reason for
failure, which is a `ValueError` when there is no matching collection.
"""
not_found_error = ValueError(f"No collection found at {cmr_host}: {params}")

return flow(
impure_safe(maap.searchCollection)(cmr_host=cmr_host, limit=1, **params),
bind_result(safe(operator.itemgetter(0))),
lash(K(IOFailure(ValueError(f"No collection found at {cmr_host}: {params}")))),
bind_ioresult(
pipe(
safe(operator.itemgetter(0)),
lash(always(IOFailure(not_found_error))),
)
),
)
2 changes: 1 addition & 1 deletion gedi-subset/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ pyarrow==8.0.0
returns==0.19.0
shapely==1.8.2
typer==0.4.1
git+https://github.com/MAAP-Project/maap-py.git@1bfa99d#egg=maappy
git+https://github.com/MAAP-Project/maap-py.git@hotfixes#egg=maappy
Loading

0 comments on commit 9d4d448

Please sign in to comment.