feat(runner): heartbeat #1493

joschahenningsen · 2025-02-05T15:12:33Z

Motivation and Context

A runner should regularly report its liveliness, load and state so gocast can make decisions when scheduling jobs.

Description

This PR

adds the fields LastSeen, Draining and JobCount to the runner model
adds an update method to the dao to allow updating the runner
creates a new type of Notification.data; HeartbeatNotification
implements receiving notifications in runner_manager.Manager
implements a goroutine that sends a notification every five seconds to the r.notifications channel
implements sending notifications from the runner to gocast, including a retry mechanism based on https://github.com/sethvargo/go-retry (an excellent lightweight library for retry handling)

Steps for Testing

Start gocast
Start runner
Observe database (field last_seen should update periodically)

…lause

This reverts commit ca0034a.

…1477)"" This reverts commit e0000c2.

runner/runner.go

Co-authored-by: Frank Elsinga <[email protected]>

carlobortolan

Tested it locally and everything works as expected. There's just one edge case/bug (?) where I'm not sure if this behavior is intended:

Setup:

Start GoCast API
Start one runner:

+-----------+-------+-------------------------+----------+-----------+
| hostname  | port  | last_seen               | draining | job_count |
+-----------+-------+-------------------------+----------+-----------+
| localhost | 43215 | 2025-02-07 01:12:21.015 |        0 |         0 |
+-----------+-------+-------------------------+----------+-----------+

Refresh the runner table; the port should be set (here to 43215), and last_seen should be updated.
Start a second runner (registered on the same hostname):

+-----------+-------+-------------------------+----------+-----------+
| hostname  | port  | last_seen               | draining | job_count |
+-----------+-------+-------------------------+----------+-----------+
| localhost | 44173 | 2025-02-07 01:13:36.044 |        0 |         0 |
+-----------+-------+-------------------------+----------+-----------+

-> As per #1490, on insert conflicts, the port is updated (here to 44173).
-> Refresh runner table; last_seen should be now updated.

To reproduce the unexpected part:

Stop the second runner:

{"time":"2025-02-07T01:16:59.060683039+01:00","level":"INFO","msg":"Runner set to drain.","version":"dev"}
2025/02/07 01:17:09 INFO No jobs left, shutting down

I'd expect the runner to be set to draining=1, but it remains draining=0.
I'd expect that last_seen is set to the time of the last notification received by the runner before it was stopped. Instead, it continues to be updated by the first runner:

+-----------+-------+-------------------------+----------+-----------+
| hostname  | port  | last_seen               | draining | job_count |
+-----------+-------+-------------------------+----------+-----------+
| localhost | 44173 | 2025-02-07 01:20:11.049 |        0 |         0 |
+-----------+-------+-------------------------+----------+-----------+

This results in the database making it look like there's an alive runner (in this case on port 44173)

joschahenningsen · 2025-02-07T06:13:43Z

I think we can assume that only one runner ever runs on a single host. If we want to change that in the future, we need to add the port to the Get() query and should be good to go

carlobortolan

Ok, I think we should keep it in mind for the future in case someone plans to deploy runners manually or if we ever support multiple runners per host.

joschahenningsen · 2025-02-07T11:28:59Z

I agree, we should get back to this in the future. Thanks for pointing out!

joschahenningsen and others added 12 commits February 2, 2025 16:09

feat(runner): implement runner registration with gocast

77e9dda

fix: use log/slog with import alias throughout cmd/tumlive/tumlive.go

229dd3f

fix(dao/runner): delete by hostname

efc80f2

fix(dao/runner): remove duplicate key column hostname in OnConflict c…

edd1a89

…lause

fix: lint protofile

ae4d17f

feat: add lint target for runner proto files

11e4871

fix: regen protos

13915c0

feat(runner): implement heartbeat notification

4fac3ae

chore: lint imports/code

fe2dc85

chore: clean up logging

a2a989d

doc(model/runner): document runner fields

2cf1c8b

chore(dao/runner): regen mock

1a1f6d9

joschahenningsen added the runner label Feb 5, 2025

joschahenningsen requested a review from DawinYurtseven February 5, 2025 15:12

joschahenningsen self-assigned this Feb 5, 2025

joschahenningsen added 2 commits February 5, 2025 16:14

Revert "Revert "chore(deps): update frontend dependencies" (#1477)"

e0000c2

This reverts commit ca0034a.

Merge branch 'dev' into feat/runner-heartbeat

73b0934

joschahenningsen marked this pull request as draft February 5, 2025 15:27

joschahenningsen marked this pull request as ready for review February 5, 2025 15:27

joschahenningsen added 2 commits February 5, 2025 16:29

fix: merge conflicts

66c55c0

Revert "Revert "Revert "chore(deps): update frontend dependencies" (#…

637e6d6

…1477)"" This reverts commit e0000c2.

joschahenningsen changed the title ~~Feat/runner heartbeat~~ feat(runner): heartbeat Feb 5, 2025

fix further merge conflicts

414f8e9

CommanderStorm reviewed Feb 5, 2025

View reviewed changes

runner/runner.go Outdated Show resolved Hide resolved

joschahenningsen requested a review from a team February 5, 2025 21:40

Update runner/runner.go

9350efe

Co-authored-by: Frank Elsinga <[email protected]>

joschahenningsen mentioned this pull request Feb 6, 2025

feat(runner): implement streaming functionality #1494

Merged

carlobortolan reviewed Feb 7, 2025

View reviewed changes

carlobortolan self-requested a review February 7, 2025 11:24

carlobortolan approved these changes Feb 7, 2025

View reviewed changes

joschahenningsen merged commit 1fe5766 into dev Feb 7, 2025
9 checks passed

joschahenningsen deleted the feat/runner-heartbeat branch February 7, 2025 11:29

carlobortolan mentioned this pull request Feb 12, 2025

feat(apiv2): stream and progress endpoints for API v2 #1495

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runner): heartbeat #1493

feat(runner): heartbeat #1493

joschahenningsen commented Feb 5, 2025

carlobortolan left a comment •

edited

Loading

joschahenningsen commented Feb 7, 2025 •

edited

Loading

carlobortolan left a comment

joschahenningsen commented Feb 7, 2025

feat(runner): heartbeat #1493

feat(runner): heartbeat #1493

Conversation

joschahenningsen commented Feb 5, 2025

Motivation and Context

Description

Steps for Testing

carlobortolan left a comment • edited Loading

Choose a reason for hiding this comment

Setup:

To reproduce the unexpected part:

joschahenningsen commented Feb 7, 2025 • edited Loading

carlobortolan left a comment

Choose a reason for hiding this comment

joschahenningsen commented Feb 7, 2025

carlobortolan left a comment •

edited

Loading

joschahenningsen commented Feb 7, 2025 •

edited

Loading