-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run_udf
: Simplified and clarified the schema for data
#515
#519
base: draft
Are you sure you want to change the base?
Conversation
"description": "A single value of any data type." | ||
} | ||
] | ||
"description": "The data to be passed to the UDF.\n\nThe data must be given in a way that the UDF can understand it. Usually, this process is executed as part of a data cube process such as `reduce_dimension` and in this case the process would expect an array of values as provided in the parameter `data` by `reduce_dimension`. Please refer to the documentation of the UDF runtime for details on how to provide the data.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data must be given in a way that the UDF can understand it.
From my UDF-usage experience this statement sounds a bit backwards. When working with UDFs in practice, the data and its form are a given and you have to tweak your UDF to match that form.
Basically in practice, the "parent" process (e.g. reduce_dimension) dictates the form of data
, so it might be confusing to the reader trying to understand how run_udf
operates by stating something like "the data must be given in a way ..."
I know that it's possible to have processes before run_udf
in reduce_dimension
's reducer, which would allow for some kind of data preparation, but I don't think there is an implementation that supports that (the geopyspark one doesn't in any case).
Also note that there isn't necessarily direct correspondence between the data
argument of run_udf
and the actual data that is available in the UDF code. For example: with apply
, the data
of the "child" callback is a single pixel value, but the UDF code could still receive larger multidimensional chunks for performance reasons. So I would avoid over-specifying the link between run_udf
's data
parameter and the actual data container available in the UDF.
Please refer to the documentation of the UDF runtime for details on how to provide the data.
Likewise: the UDF runtime docs can provide little help here, it's the parent process that dictates the form of run_udf
's data
parameter
Attempt to tweak the description:
"description": "The data to be passed to the UDF.\n\nThe data must be given in a way that the UDF can understand it. Usually, this process is executed as part of a data cube process such as `reduce_dimension` and in this case the process would expect an array of values as provided in the parameter `data` by `reduce_dimension`. Please refer to the documentation of the UDF runtime for details on how to provide the data.", | |
"description": "The data to be passed to the UDF.\n\nUsually, `run_udf` process is used as a child process of a data cube process such as `reduce_dimension` or `apply`. These data cube processes establish the actual type of this data. For example, in case of `reduce_dimension`, `data` will be an array of values, while in case of `apply` it will be a single scalar value.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also fine to just leave the description as is ("The data to be passed to the UDF."), to avoid bike-shedding this too much.
"description": "A single value of any data type." | ||
} | ||
] | ||
"description": "The data to be passed to the UDF.\n\nThe data must be given in a way that the UDF can understand it. Usually, this process is executed as part of a data cube process such as `reduce_dimension` and in this case the process would expect an array of values as provided in the parameter `data` by `reduce_dimension`. Please refer to the documentation of the UDF runtime for details on how to provide the data.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see run_udf
comment
Fixes #515