Validate `columns` and `query` inputs during "startup" #47

chuckwondo · 2023-07-21T22:02:58Z

Currently, if a user supplies an invalid column name, or an invalid query, job execution won't fail until after 2 CMR queries are made (1 to get the collection ID, and 1 to get the list of granules intersecting the AOI), and at least 1 granule file is downloaded, and an attempt is made to subset the downloaded file. This can not only take significant time (perhaps many minutes in the case of a AOI that intersects hundreds of granules, due to the CMR query), but can also produce a fair bit of logging "noise", making it hard to quickly see that the job failure is due to a bad column name or query.

To enable a more fail-fast approach, which also provides more helpful error messages for users, validate the columns and query input values as early as possible, even prior to making any CMR queries (and thus also prior to downloading any files). This should be based upon the user-specified doi since different GEDI collections have different possible column (dataset) names.

By taking a fail-fast approach to validating the column names and the query expression, we can fail within seconds (rather than minutes) when an invalid column name or query is supplied, which not only greatly shortens the feedback loop, but also greatly reduces the amount of logged output before failure, and allows for a very clear and precise error message for the user.

Based upon a design session with @jjfrench, we propose 2 possible (and similar) approaches.

Approach 1

For each DOI, create a {doi}.csv file (perhaps next to subset.py?) with a header row containing all possible column (dataset) names for the corresponding DOI, and 1 data row with a dummy value of the correct data type (int or float). The column names should include the "path" to the dataset, relative to the parent BEAMXXXX group. For example, for a dataset located at BEAMXXXX/dataset1, the column name should be dataset1, and for a dataset at BEAMXXXX/subgroup/dataset2, the column name should be subgroup/dataset2. These are the same names that a user would supply as inputs.

Creating these csv files will be rather tedious work (obtained via the data dictionaries for the DOIs), but should enable us to write very little code for this enhancement. For 2D datasets, the corresponding column in the csv file should be named using the format <name>{<ncols>}, where <name> is the name of the 2D dataset, and <ncols> is the number of columns in the 2D dataset. For example, if xvar is a 2D dataset with 4 columns, then the csv file should include a column named xvar{4} (prefixed with its relative path, if any, as illustrated above [use of curly braces is not mandatory, but the final syntax should provide some means of unambiguous distinction between the name and the number of columns -- e.g., xvar[4] or xvar:4]).

Validate user supplied columns (after splitting on commas and trimming) and query inputs using the following (roughly):

doi_df = pd.read_csv(f"{doi}.csv")
# TODO: expand all 2D columns into appropriate 1D columns.
# For example, replace column `xvar{4}` with columns `xvar0` through `xvar3`
if query:
    doi_df.query(query)  # Raise exception if query is invalid
for column in set(columns):
    doi_df[column]  # Raise exception if any column name is invalid
doi_df[lon]  # Raise if longitude column is invalid
doi_df[lat]  # Raise if latitude column is invalid

The code above (and the code to be written based on the TODO comment) should be sufficient, but will fail the job on the first error encountered. If there are multiple problems, only the first will be identified, thus requiring the user to run multiple jobs to uncover all problems with the columns and query inputs. Therefore, to allow the user to see all problems with columns and query at once, the code above should be enhanced to capture all exceptions, and then fail the job with a list of all the error messages.

Approach 2

Rather than creating a csv per DOI, obtain an actual h5 file for each DOI, and remove all but 1 value from every dataset in each file. Similar to constructing the csv files in the other approach, this too is likely tedious work, but perhaps leads to more reliable validation results.

Validate user-supplied columns and query values similar to the other approach, along these lines:

with h5py.File(f"{doi}.h5") as hdf5:
    gdf = subset_hdf5(
        hdf5,
        aoi=aoi_gdf,
        lon_col=lon,
        lat_col=lat,
        beam_filter=beam_filter(beams),
        columns=columns,
        query=query,
    )

for column in set(columns):
    gdf[column]  # Raise exception if any column name is invalid

Again, a bit of additional code would be required to capture all problems at once.

Capturing Multiple Errors

The only situation where not all problems would be captured at once, regardless of the choice of approach, is when the query expression has multiple problems itself. There is no clean approach to identifying them all at once because we are relying on the pandas expression parser to raise an error when the query is invalid, and the parser does not identify all problems at once.

However, we can readily capture multiple errors covering 1 error in the query and multiple column errors (i.e., multiple invalid column names) by adding a bit more logic to either code snippet given above.

Since we are using Python 3.11, this is a good case for using an ExceptionGroup, which was introduced in Python 3.11. We can capture all errors in an ExceptionGroup and raise the group to fail the job (of course, only if there are any errors captured while validating columns and query).

The text was updated successfully, but these errors were encountered:

chuckwondo changed the title ~~Validate columns and query inputs during "startup" time~~ Validate columns and query inputs during "startup" Jul 21, 2023

chuckwondo mentioned this issue Jul 21, 2023

Automatically "spread" user-selected 2D datasets into multiple columns #48

Open

chuckwondo added the enhancement New feature or request label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate `columns` and `query` inputs during "startup" #47

Validate `columns` and `query` inputs during "startup" #47

chuckwondo commented Jul 21, 2023

Validate columns and query inputs during "startup" #47

Validate columns and query inputs during "startup" #47

Comments

chuckwondo commented Jul 21, 2023

Approach 1

Approach 2

Capturing Multiple Errors

Validate `columns` and `query` inputs during "startup" #47

Validate `columns` and `query` inputs during "startup" #47