You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, if a user supplies an invalid column name, or an invalid query, job execution won't fail until after 2 CMR queries are made (1 to get the collection ID, and 1 to get the list of granules intersecting the AOI), and at least 1 granule file is downloaded, and an attempt is made to subset the downloaded file. This can not only take significant time (perhaps many minutes in the case of a AOI that intersects hundreds of granules, due to the CMR query), but can also produce a fair bit of logging "noise", making it hard to quickly see that the job failure is due to a bad column name or query.
To enable a more fail-fast approach, which also provides more helpful error messages for users, validate the columns and query input values as early as possible, even prior to making any CMR queries (and thus also prior to downloading any files). This should be based upon the user-specified doi since different GEDI collections have different possible column (dataset) names.
By taking a fail-fast approach to validating the column names and the query expression, we can fail within seconds (rather than minutes) when an invalid column name or query is supplied, which not only greatly shortens the feedback loop, but also greatly reduces the amount of logged output before failure, and allows for a very clear and precise error message for the user.
Based upon a design session with @jjfrench, we propose 2 possible (and similar) approaches.
Approach 1
For each DOI, create a {doi}.csv file (perhaps next to subset.py?) with a header row containing all possible column (dataset) names for the corresponding DOI, and 1 data row with a dummy value of the correct data type (int or float). The column names should include the "path" to the dataset, relative to the parent BEAMXXXX group. For example, for a dataset located at BEAMXXXX/dataset1, the column name should be dataset1, and for a dataset at BEAMXXXX/subgroup/dataset2, the column name should be subgroup/dataset2. These are the same names that a user would supply as inputs.
Creating these csv files will be rather tedious work (obtained via the data dictionaries for the DOIs), but should enable us to write very little code for this enhancement. For 2D datasets, the corresponding column in the csv file should be named using the format <name>{<ncols>}, where <name> is the name of the 2D dataset, and <ncols> is the number of columns in the 2D dataset. For example, if xvar is a 2D dataset with 4 columns, then the csv file should include a column named xvar{4} (prefixed with its relative path, if any, as illustrated above [use of curly braces is not mandatory, but the final syntax should provide some means of unambiguous distinction between the name and the number of columns -- e.g., xvar[4] or xvar:4]).
Validate user supplied columns (after splitting on commas and trimming) and query inputs using the following (roughly):
doi_df=pd.read_csv(f"{doi}.csv")
# TODO: expand all 2D columns into appropriate 1D columns.# For example, replace column `xvar{4}` with columns `xvar0` through `xvar3`ifquery:
doi_df.query(query) # Raise exception if query is invalidforcolumninset(columns):
doi_df[column] # Raise exception if any column name is invaliddoi_df[lon] # Raise if longitude column is invaliddoi_df[lat] # Raise if latitude column is invalid
The code above (and the code to be written based on the TODO comment) should be sufficient, but will fail the job on the first error encountered. If there are multiple problems, only the first will be identified, thus requiring the user to run multiple jobs to uncover all problems with the columns and query inputs. Therefore, to allow the user to see all problems with columns and query at once, the code above should be enhanced to capture all exceptions, and then fail the job with a list of all the error messages.
Approach 2
Rather than creating a csv per DOI, obtain an actual h5 file for each DOI, and remove all but 1 value from every dataset in each file. Similar to constructing the csv files in the other approach, this too is likely tedious work, but perhaps leads to more reliable validation results.
Validate user-supplied columns and query values similar to the other approach, along these lines:
withh5py.File(f"{doi}.h5") ashdf5:
gdf=subset_hdf5(
hdf5,
aoi=aoi_gdf,
lon_col=lon,
lat_col=lat,
beam_filter=beam_filter(beams),
columns=columns,
query=query,
)
forcolumninset(columns):
gdf[column] # Raise exception if any column name is invalid
Again, a bit of additional code would be required to capture all problems at once.
Capturing Multiple Errors
The only situation where not all problems would be captured at once, regardless of the choice of approach, is when the query expression has multiple problems itself. There is no clean approach to identifying them all at once because we are relying on the pandas expression parser to raise an error when the query is invalid, and the parser does not identify all problems at once.
However, we can readily capture multiple errors covering 1 error in the query and multiple column errors (i.e., multiple invalid column names) by adding a bit more logic to either code snippet given above.
Since we are using Python 3.11, this is a good case for using an ExceptionGroup, which was introduced in Python 3.11. We can capture all errors in an ExceptionGroup and raise the group to fail the job (of course, only if there are any errors captured while validating columns and query).
The text was updated successfully, but these errors were encountered:
chuckwondo
changed the title
Validate columns and query inputs during "startup" time
Validate columns and query inputs during "startup"
Jul 21, 2023
Currently, if a user supplies an invalid column name, or an invalid query, job execution won't fail until after 2 CMR queries are made (1 to get the collection ID, and 1 to get the list of granules intersecting the AOI), and at least 1 granule file is downloaded, and an attempt is made to subset the downloaded file. This can not only take significant time (perhaps many minutes in the case of a AOI that intersects hundreds of granules, due to the CMR query), but can also produce a fair bit of logging "noise", making it hard to quickly see that the job failure is due to a bad column name or query.
To enable a more fail-fast approach, which also provides more helpful error messages for users, validate the
columns
andquery
input values as early as possible, even prior to making any CMR queries (and thus also prior to downloading any files). This should be based upon the user-specifieddoi
since different GEDI collections have different possible column (dataset) names.By taking a fail-fast approach to validating the column names and the query expression, we can fail within seconds (rather than minutes) when an invalid column name or query is supplied, which not only greatly shortens the feedback loop, but also greatly reduces the amount of logged output before failure, and allows for a very clear and precise error message for the user.
Based upon a design session with @jjfrench, we propose 2 possible (and similar) approaches.
Approach 1
For each DOI, create a
{doi}.csv
file (perhaps next tosubset.py
?) with a header row containing all possible column (dataset) names for the corresponding DOI, and 1 data row with a dummy value of the correct data type (int or float). The column names should include the "path" to the dataset, relative to the parentBEAMXXXX
group. For example, for a dataset located atBEAMXXXX/dataset1
, the column name should bedataset1
, and for a dataset atBEAMXXXX/subgroup/dataset2
, the column name should besubgroup/dataset2
. These are the same names that a user would supply as inputs.Creating these csv files will be rather tedious work (obtained via the data dictionaries for the DOIs), but should enable us to write very little code for this enhancement. For 2D datasets, the corresponding column in the csv file should be named using the format
<name>{<ncols>}
, where<name>
is the name of the 2D dataset, and<ncols>
is the number of columns in the 2D dataset. For example, ifxvar
is a 2D dataset with 4 columns, then the csv file should include a column namedxvar{4}
(prefixed with its relative path, if any, as illustrated above [use of curly braces is not mandatory, but the final syntax should provide some means of unambiguous distinction between the name and the number of columns -- e.g.,xvar[4]
orxvar:4
]).Validate user supplied
columns
(after splitting on commas and trimming) andquery
inputs using the following (roughly):The code above (and the code to be written based on the
TODO
comment) should be sufficient, but will fail the job on the first error encountered. If there are multiple problems, only the first will be identified, thus requiring the user to run multiple jobs to uncover all problems with thecolumns
andquery
inputs. Therefore, to allow the user to see all problems withcolumns
andquery
at once, the code above should be enhanced to capture all exceptions, and then fail the job with a list of all the error messages.Approach 2
Rather than creating a csv per DOI, obtain an actual
h5
file for each DOI, and remove all but 1 value from every dataset in each file. Similar to constructing the csv files in the other approach, this too is likely tedious work, but perhaps leads to more reliable validation results.Validate user-supplied
columns
andquery
values similar to the other approach, along these lines:Again, a bit of additional code would be required to capture all problems at once.
Capturing Multiple Errors
The only situation where not all problems would be captured at once, regardless of the choice of approach, is when the query expression has multiple problems itself. There is no clean approach to identifying them all at once because we are relying on the
pandas
expression parser to raise an error when the query is invalid, and the parser does not identify all problems at once.However, we can readily capture multiple errors covering 1 error in the query and multiple column errors (i.e., multiple invalid column names) by adding a bit more logic to either code snippet given above.
Since we are using Python 3.11, this is a good case for using an
ExceptionGroup
, which was introduced in Python 3.11. We can capture all errors in anExceptionGroup
and raise the group to fail the job (of course, only if there are any errors captured while validatingcolumns
andquery
).The text was updated successfully, but these errors were encountered: