Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-46137: Enforce timeouts in GitHub Checks runs #81

Merged
merged 16 commits into from
Sep 12, 2024
Merged

Conversation

jonathansick
Copy link
Member

@jonathansick jonathansick commented Sep 6, 2024

  • Times Square now specifies a timeout for notebook executions with Noteburst. This provides better feedback for notebook executions that have hung, either when rendering new pages for users, for when doing a GitHub Check Run. Currently this timeout is set app-wide with the TS_DEFAULT_EXECUTION_TIMEOUT environment variable. The default timeout is 60 seconds. In the future we intend to add per-notebook timeout configuration.

  • Times Square enforces a timeout when polling for Noteburst results during a GitHub Check run. This prevents the Check Run from hanging indefinitely if the Noteburst service is unable to time out a notebook execution or fails for any reason. This timeout is configurable via the TS_CHECK_RUN_TIMEOUT environment variable. The default timeout is 600 seconds.

This minor refactoring is just to reduce the length of the method.
If a noteburst job is still in progress after 5 minutes in a check run
then we report a new timeout failure. This ought to be configurable but
demonstrates a pattern for making sure we don't get stuck with hanging
notebook executions.
Rather than checking the runtime time of each notebook, which doesn't
work well because we may not get the in-progress status of the notebook
we can instead use wait_for to put a timeout on the noteburst polling
call.
We want these check runs to be slightly longer than the configuration so
that the worker's code has a change to report the timeout before arq
kills the job itself and we lose the opportunity to report the check run
as failed to GitHub.

It seems that there are typing issues with using arq.worker.func, so
we'll try to ignore those for now.
jonathansick added a commit to lsst-sqre/phalanx that referenced this pull request Sep 9, 2024
Previously if the notebook ran at all, but raised an exception, we still
marked it as a success. By looking at the ipynb_error field we now can
indicate if the notebook itself is broken.
This uses the new "error" field in the noteburst response that
classifies the source of the noteburst execution error for better user
feedback. This will show the user whether the error was a result of a
timeout or an issue with noteburst itself.
This configuration sets an app-wide default configuration for the
timeout to set when executing a notebook with noteburst (either during a
check run for a PR or in actual use). Using a check run should help
ensure that we don't have hanging executions or check runs and can also
encourage authors to keep pages fast.

In the future we'll add per-page customizations to the timeout so
authors can set longer timeouts when needed, or even set shorter
timeouts to fail fast in quick notebooks.
This is now handled by noteburst when we set the notebook execution
timeout in the execution request. This is a better approach because we
don't actually see the in_progress state for a notebook execution
(again, because arq doesn't expose that to the job metadata, only to the
job result it seems(?)
This results in better printing e.g. in error messages or logs.
@jonathansick jonathansick marked this pull request as ready for review September 12, 2024 19:04
@jonathansick jonathansick merged commit a52110f into main Sep 12, 2024
4 checks passed
@jonathansick jonathansick deleted the tickets/DM-46137 branch September 12, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant