feature: simplify health/readiness checks #79

rustatian · 2024-09-15T21:26:02Z

Reason for This PR

closes: roadrunner-server/roadrunner#1998

Description of Changes

Support health/readiness checks w/o specifying plugins list.
TODO: update docs.
Strict response schema (even on 5xx status codes), array:

[
    {
        "plugin_name": "string",
        "error_message": "string",
        "status_code": "int"
    }
    {
    ...
    }
]

For the /jobs endpoint, array:

[
    {
        "pipeline": "string",
        "priority": "int",
        "ready": "bool"
        "queue": "string",
        "active": "int",
        "delayed": "int"
        "reserved": "int",
        "driver": "string",
        "error_message": "string"
    }
    {
    ...
    }
]

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

All commits in this PR are signed (git commit -s).
The reason for this PR is clearly provided (issue no. or explanation).
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this PR.
Any user-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.

Summary by CodeRabbit

New Features
- Introduced new data structures for reporting plugin and job statuses, enhancing observability.
- Improved health check and readiness handler to provide structured JSON responses with detailed error reporting.
- Increased polling capability by modifying configuration for concurrent job processing.
Bug Fixes
- Enhanced error handling in health checks and job reporting to improve clarity and robustness.
Refactor
- Replaced string-based job state reporting with structured data handling for better maintainability.
Tests
- Added new test cases for validating health and readiness of plugins, improving test reliability and clarity.

Signed-off-by: Valery Piashchynski <[email protected]>

coderabbitai · 2024-09-15T21:26:10Z

Walkthrough

The changes in this pull request involve modifications to several files related to the health check and job reporting functionalities of a Go application. Key updates include the introduction of new data structures for reporting plugin and job statuses, enhancements to the health check mechanism for improved observability, and adjustments to configuration files to support better job management. Additionally, some linters were removed from the configuration, and tests were updated to improve reliability and clarity.

Changes

Files	Change Summary
`.golangci.yml`	Removed the `gocognit` and `gocyclo` linters from the configuration file.
`api.go`, `jobs.go`	Added new data structures: `Report` and `JobsReport` in `api.go`, and refactored job reporting in `jobs.go` to use `JobsReport`.
`health.go`, `ready.go`	Enhanced health check functionality in `health.go` with structured reporting and improved error handling. Updated `ready.go` to provide detailed JSON responses for plugin readiness.
`plugin.go`	Removed the `template` string constant related to plugin status formatting.
`tests/configs/.rr-jobs-status.yaml`	Increased the number of pollers from 1 to 2 in the job status configuration file.
`tests/configs/.rr-status-503.yaml`	Added a new section for job management, specifying parameters for job processing including `num_pollers`, `pipeline_size`, and worker pool settings.
`tests/plugin_test.go`	Introduced new test cases for health and readiness checks, updated existing tests for structured assertions, and standardized error handling across tests.

Assessment against linked issues

Objective	Addressed	Explanation
Define a single source for determining the health of the entire application server (#1998)	✅
Simplify health check requests to not require explicit plugin listing (#1998)	❓	The changes improve reporting but do not fully implement the proposal for a simpler health check.

Possibly related PRs

fix: properly return 503 status when shutdown was initiated #65: Modifies the health.go file to ensure that the status server returns a 503 status during shutdown, related to health check functionalities.
fix: healthcheck endpoint during graceful shutdown #71: Changes the behavior of the /health endpoint during a graceful shutdown, connecting to improvements in health check responses.

Poem

🐇 In the meadow where plugins play,
Health checks dance in a bright array.
With reports so clear, and jobs in line,
Our server's health will surely shine!
No more lost plugins in the fray,
Hooray for changes that save the day! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-- I pushed a fix in commit <commit_id>, please review it.
-- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
-- @coderabbitai generate unit testing code for this file.
-- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
-- @coderabbitai generate interesting stats about this repository and render them as a table.
-- @coderabbitai read src/utils.ts and generate unit testing code.
-- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

Outside diff range and nitpick comments (2)

tests/configs/.rr-jobs-status.yaml (1)

14-14: Looks good! Consider monitoring performance and resource utilization post-deployment.

Increasing num_pollers from 1 to 2 can potentially improve the system's ability to process jobs concurrently, leading to better performance and reduced processing times.

However, it's important to monitor the system's resource utilization (CPU, memory) and overall performance after deploying this change to production. If the increased concurrency leads to resource contention or degraded performance, you may need to adjust the num_pollers value or scale the system's resources accordingly.

Consider collecting metrics and logs related to job processing, queue lengths, and resource utilization. Analyze this data to determine if further tuning of the num_pollers parameter is necessary to achieve the desired performance characteristics while ensuring the system remains stable and responsive.
ready.go (1)
36-37: Consider initializing the report slice with a more appropriate capacity.

The current initialization of the report slice with a capacity of 2 seems arbitrary and may not be sufficient for all scenarios. Consider initializing the slice with a capacity equal to the number of plugins in the statusRegistry to avoid unnecessary allocations and improve performance.
-report := make([]*Report, 0, 2)
+report := make([]*Report, 0, len(rd.statusRegistry))

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1b9c916 and 2d97bc0.

Files ignored due to path filters (2)

go.work.sum is excluded by !**/*.sum
tests/go.sum is excluded by !**/*.sum

Files selected for processing (9)

.golangci.yml (0 hunks)
api.go (1 hunks)
health.go (2 hunks)
jobs.go (2 hunks)
plugin.go (0 hunks)
ready.go (2 hunks)
tests/configs/.rr-jobs-status.yaml (1 hunks)
tests/configs/.rr-status-503.yaml (2 hunks)
tests/plugin_test.go (14 hunks)

Files not reviewed due to no reviewable changes (2)

.golangci.yml
plugin.go

Additional comments not posted (17)

tests/configs/.rr-status-503.yaml (2)

14-20: LGTM! Consider adjusting the settings for production deployment.

The introduction of the jobs section with parameters for job management aligns with the PR objective of enhancing job processing capabilities. The current settings seem suitable for testing or low-traffic scenarios.

For production deployment, consider the following:

Adjust num_pollers and num_workers based on the expected workload and system resources to ensure optimal performance and scalability.

Fine-tune the pipeline_size value based on the specific use case and available system resources to strike a balance between throughput and resource utilization.

Monitor the allocate_timeout and destroy_timeout values in production and adjust them if necessary to ensure efficient resource management without causing resource contention or starvation.

32-32: Looks good!

The grace_period value of 10 seconds seems appropriate for allowing the server to shut down gracefully. It provides sufficient time for the server to complete any ongoing tasks and release resources gracefully before terminating.

api.go (2)

3-7: LGTM!

The Report struct is well-defined and aligns with the intended usage. The fields are appropriately named and tagged for JSON serialization.

9-19: LGTM!

The JobsReport struct is well-defined and captures essential job-related metrics. The fields are appropriately named and tagged for JSON serialization. This struct will be valuable for monitoring and debugging job performance and health.

jobs.go (2)

49-60: LGTM!

The code correctly populates the report slice with JobsReport instances created from the jobStates elements. The field mappings are accurate, and the logic is sound.

62-66: LGTM!

The code correctly marshals the report slice into JSON format and handles any errors that may occur during the process. The error handling ensures that any issues are logged appropriately.

health.go (5)

4-4: LGTM!

The "encoding/json" package import is necessary for JSON serialization and deserialization in the ServeHTTP method.

33-34: LGTM!

The report slice is initialized with a capacity of 2, which is a good practice to avoid unnecessary allocations. The comment provides clarity on the purpose of the report slice.

37-106: Comprehensive health check implementation!

The code block handles the case when no plugins are provided in the query. It performs a comprehensive health check by iterating over all registered plugins and checking their status. The detailed reports appended to the report slice provide valuable information about each plugin's health status, including error messages and status codes.

The code handles various scenarios such as uninitialized plugins, plugins returning nil, and unexpected status codes. The response is properly serialized to JSON and written to the response writer.

Overall, this code block enhances the health check functionality by providing a comprehensive overview of all plugins' health status when no specific plugins are requested.

108-160: Targeted health check implementation!

The code block iterates over all provided plugins in the query and performs a targeted health check. It checks the status of each plugin and appends detailed reports to the report slice, including error messages and status codes.

The code handles various scenarios such as uninitialized plugins, plugins returning nil, and unexpected status codes. Setting the response status code to unavailableStatusCode for status codes >= 500 is appropriate.

Additionally, the code logs information when a plugin does not support health checks, providing visibility into unsupported plugins.

Overall, this code block enhances the health check functionality by allowing targeted health checks for specific plugins provided in the query.

161-168: LGTM!

The report slice is serialized to JSON, and the serialized JSON response is written to the response writer, which is the correct way to send the response.

Any error during JSON marshaling is logged, which is important for debugging purposes.

ready.go (4)

40-96: LGTM!

The code changes for handling the readiness checks of all plugins when no specific plugins are provided in the query parameters look good. The code flow, error handling, and logging seem appropriate. The report slice is being populated correctly based on the readiness status of each plugin.

111-159: LGTM!

The code changes for handling the readiness checks of the provided plugins in the query parameters look good. The code flow, error handling, and HTTP status code setting seem appropriate. The report slice is being populated correctly based on the readiness status of each provided plugin.

160-162: LGTM!

The code correctly logs an informational message when a provided plugin does not support readiness checks and continues execution without adding any report for the unsupported plugin.

169-170: LGTM!

The code correctly writes the JSON response to the ResponseWriter. Although the error returned by w.Write is being ignored, it is a common practice in HTTP handlers.

tests/plugin_test.go (2)

661-680: LGTM!

The checkHTTPStatusAll function correctly verifies the response from the /health endpoint without any query parameters. The changes improve the reliability of the test by unmarshalling the JSON response and asserting the contents of the status.Report slice.

719-733: LGTM!

The checkHTTPReadinessAll function correctly verifies the response from the /ready endpoint without any query parameters. The changes improve the reliability of the test by unmarshalling the JSON response into a slice of status.Report structs and asserting the response status code.

jobs.go

ready.go

Signed-off-by: Valery Piashchynski <[email protected]>

coderabbitai

Actionable comments posted: 0

Outside diff range and nitpick comments (1)

health.go (1)
33-35: Consider setting the report slice capacity based on the expected number of plugins.

The current capacity of 2 seems arbitrary. To avoid unnecessary allocations, set the capacity based on the expected number of plugins.

For example, if the expected number of plugins is known, you can set the capacity as follows:
report := make([]*Report, 0, expectedPluginCount)
If the expected number of plugins is unknown, you can use the length of the statusRegistry map as a reasonable estimate:
report := make([]*Report, 0, len(rd.statusRegistry))

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 2d97bc0 and 8546b5e.

Files selected for processing (3)

health.go (2 hunks)
jobs.go (2 hunks)
ready.go (2 hunks)

Files skipped from review as they are similar to previous changes (1)

ready.go

Additional comments not posted (8)

jobs.go (4)

45-46: Suggestion implemented: Using len(jobStates) as the capacity for the report slice.

The code has been updated to set the capacity of the report slice to len(jobStates), as suggested in the previous review. This ensures that the slice has enough capacity to store all the job reports without the need for reallocation.

49-60: LGTM: Structured approach for job state reporting.

The code segment populates the report slice by iterating over jobStates and creating JobsReport instances. This change enhances code clarity and maintainability by using a structured approach to represent job state information. The implementation looks good.

62-66: LGTM: JSON serialization and error handling.

The code segment serializes the report slice into JSON format using json.Marshal, which is a standardized format that is easier for clients to consume. Error handling has been introduced to ensure that any issues during the JSON marshaling process are logged appropriately. The implementation looks good.

68-70: Suggestion implemented: Handling the error returned by w.Write.

The code has been updated to handle the error returned by w.Write, as suggested in the previous review. This ensures that any issues during the write operation are logged appropriately.

health.go (4)

38-106: LGTM!

The code segment handling the scenario where no plugins are provided in the query looks good. It comprehensively checks the health of all registered plugins, handles various error scenarios, and provides a structured JSON response. The error handling and logging are also implemented appropriately.

111-163: LGTM!

The code segment handling the scenario where specific plugins are provided in the query looks good. It checks the health of the specified plugins, handles various error scenarios, and provides a structured JSON response. The error handling and logging are also implemented appropriately.

165-168: LGTM!

The code segment for serializing the report slice to JSON looks good. It properly marshals the data and logs any errors that occur during the process.

170-174: LGTM!

The code segment for writing the serialized JSON response to the response writer looks good. It properly writes the data and logs any errors that occur during the process.

feature: simplify health/readiness checks

2d97bc0

Signed-off-by: Valery Piashchynski <[email protected]>

rustatian added the enhancement New feature or request label Sep 15, 2024

rustatian requested a review from wolfy-j September 15, 2024 21:26

rustatian self-assigned this Sep 15, 2024

coderabbitai bot reviewed Sep 15, 2024

View reviewed changes

jobs.go Outdated Show resolved Hide resolved

jobs.go Outdated Show resolved Hide resolved

ready.go Outdated Show resolved Hide resolved

ready.go Show resolved Hide resolved

chore: update error handling

8546b5e

Signed-off-by: Valery Piashchynski <[email protected]>

coderabbitai bot reviewed Sep 15, 2024

View reviewed changes

rustatian merged commit dc69b41 into master Sep 16, 2024
7 checks passed

rustatian deleted the feature/1998-simplify-health-checks branch September 16, 2024 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: simplify health/readiness checks #79

feature: simplify health/readiness checks #79

rustatian commented Sep 15, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 15, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

feature: simplify health/readiness checks #79

feature: simplify health/readiness checks #79

Conversation

rustatian commented Sep 15, 2024 • edited by coderabbitai bot Loading

Reason for This PR

Description of Changes

License Acceptance

PR Checklist

Summary by CodeRabbit

coderabbitai bot commented Sep 15, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

rustatian commented Sep 15, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 15, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)