Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(runner): heartbeat #1493

Merged
merged 18 commits into from
Feb 7, 2025
Merged

feat(runner): heartbeat #1493

merged 18 commits into from
Feb 7, 2025

Conversation

joschahenningsen
Copy link
Member

Motivation and Context

A runner should regularly report its liveliness, load and state so gocast can make decisions when scheduling jobs.

Description

This PR

  • adds the fields LastSeen, Draining and JobCount to the runner model
  • adds an update method to the dao to allow updating the runner
  • creates a new type of Notification.data; HeartbeatNotification
  • implements receiving notifications in runner_manager.Manager
  • implements a goroutine that sends a notification every five seconds to the r.notifications channel
  • implements sending notifications from the runner to gocast, including a retry mechanism based on https://github.com/sethvargo/go-retry (an excellent lightweight library for retry handling)

Steps for Testing

  1. Start gocast
  2. Start runner
  3. Observe database (field last_seen should update periodically)

@joschahenningsen joschahenningsen marked this pull request as draft February 5, 2025 15:27
@joschahenningsen joschahenningsen marked this pull request as ready for review February 5, 2025 15:27
@joschahenningsen joschahenningsen changed the title Feat/runner heartbeat feat(runner): heartbeat Feb 5, 2025
runner/runner.go Outdated Show resolved Hide resolved
@joschahenningsen joschahenningsen requested a review from a team February 5, 2025 21:40
Co-authored-by: Frank Elsinga <[email protected]>
Copy link
Member

@carlobortolan carlobortolan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it locally and everything works as expected. There's just one edge case/bug (?) where I'm not sure if this behavior is intended:

Setup:

  1. Start GoCast API
  2. Start one runner:
+-----------+-------+-------------------------+----------+-----------+
| hostname  | port  | last_seen               | draining | job_count |
+-----------+-------+-------------------------+----------+-----------+
| localhost | 43215 | 2025-02-07 01:12:21.015 |        0 |         0 |
+-----------+-------+-------------------------+----------+-----------+
  1. Refresh the runner table; the port should be set (here to 43215), and last_seen should be updated.
  2. Start a second runner (registered on the same hostname):
+-----------+-------+-------------------------+----------+-----------+
| hostname  | port  | last_seen               | draining | job_count |
+-----------+-------+-------------------------+----------+-----------+
| localhost | 44173 | 2025-02-07 01:13:36.044 |        0 |         0 |
+-----------+-------+-------------------------+----------+-----------+

-> As per #1490, on insert conflicts, the port is updated (here to 44173).
-> Refresh runner table; last_seen should be now updated.

To reproduce the unexpected part:

  1. Stop the second runner:
{"time":"2025-02-07T01:16:59.060683039+01:00","level":"INFO","msg":"Runner set to drain.","version":"dev"}
2025/02/07 01:17:09 INFO No jobs left, shutting down
  • I'd expect the runner to be set to draining=1, but it remains draining=0.
  • I'd expect that last_seen is set to the time of the last notification received by the runner before it was stopped. Instead, it continues to be updated by the first runner:
+-----------+-------+-------------------------+----------+-----------+
| hostname  | port  | last_seen               | draining | job_count |
+-----------+-------+-------------------------+----------+-----------+
| localhost | 44173 | 2025-02-07 01:20:11.049 |        0 |         0 |
+-----------+-------+-------------------------+----------+-----------+
  • This results in the database making it look like there's an alive runner (in this case on port 44173)

@joschahenningsen
Copy link
Member Author

joschahenningsen commented Feb 7, 2025

I think we can assume that only one runner ever runs on a single host. If we want to change that in the future, we need to add the port to the Get() query and should be good to go

@carlobortolan carlobortolan self-requested a review February 7, 2025 11:24
Copy link
Member

@carlobortolan carlobortolan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think we should keep it in mind for the future in case someone plans to deploy runners manually or if we ever support multiple runners per host.

@joschahenningsen
Copy link
Member Author

I agree, we should get back to this in the future. Thanks for pointing out!

@joschahenningsen joschahenningsen merged commit 1fe5766 into dev Feb 7, 2025
9 checks passed
@joschahenningsen joschahenningsen deleted the feat/runner-heartbeat branch February 7, 2025 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants