-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nfd refractoring #55
Nfd refractoring #55
Conversation
2. nfd consts put in nfd package and imported
2. used UndefinedValue const in gpu test instead of the string "undefined"
…ts package, and adjusted imports
WalkthroughThe changes introduce new packages and constants for Node Feature Discovery (NFD) and NVIDIA GPU functionality. Multiple configuration structs and functions have been added to streamline deployments and testing of NFD operators, including managing PCI device configurations and custom catalog sources. Test files are refactored to utilize these new constants and configurations. Additionally, dependency versions are updated in the module files while vendor libraries (glog) receive logging and platform-specific enhancements. Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Hi @AdamKaabyia. Thanks for your PR. I'm waiting for a rh-ecosystem-edge member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
pkg/nfd/consts.go (2)
3-19
: Consolidate related constants into cohesive sections.All the constants are well-defined but could benefit from grouping. For example, place all NFD-related constants together, and separate the “common” constants such as
UndefinedValue
,OperatorVersionFile
, andOpenShiftVersionFile
into a distinct block for clarity.const ( + // NFD constants NfdCustomNFDCatalogSourcePublisherName = "Red Hat" NfdCustomCatalogSourceDisplayName = "Redhat Operators Custom" NfdRhcosLabel = "feature.node.kubernetes.io/system-os_release.ID" NfdRhcosLabelValue = "rhcos" NfdOperatorNamespace = "openshift-nfd" NfdCatalogSourceDefault = "redhat-operators" NfdCatalogSourceNamespace = "openshift-marketplace" NfdOperatorDeploymentName = "nfd-controller-manager" NfdPackage = "nfd" NfdCRName = "nfd-instance" - //not related to NFD but common consts between gpu and nno + // Common constants UndefinedValue = "undefined" OperatorVersionFile = "operator.version" OpenShiftVersionFile = "ocp.version" )
3-19
: Document the purpose of the constants.Although these constants are self-explanatory, adding short Go doc comments for key constants can help downstream maintainers by clarifying where and how each constant should be used.
tests/nvidianetwork/deploy-nno-test.go (3)
62-62
: Consider clarifying the custom-catalog-source creation logic.The boolean
createNNOCustomCatalogsource
can be more descriptive, for instanceenableCustomNnoCatalogSource
, to clearly convey that the catalog creation is conditional.- createNNOCustomCatalogsource bool = false + enableCustomNnoCatalogSource bool = false
66-70
: Avoid repeating the “NFD vars” comment multiple times.While this comment clarifies context, the logic here is mostly consistent with the GPU flow that also deals with NFD. Consider either removing or renaming the comment for better clarity.
787-787
: Check concurrency for parallel tests.Because the code introduces a timed wait for the NFD operator deployment, ensure that multiple test suites running in parallel do not interfere (e.g., by creating or deleting the same operator). If parallel runs are intended, incorporate additional namespaces or concurrency guards where necessary.
tests/nvidiagpu/deploy-gpu-test.go (2)
62-65
: Refine variable names for readability.Variables like
createGPUCustomCatalogsource
andcreateNFDCustomCatalogsource
can be slightly ambiguous. Prefer prefixes likeenable...
to make the usage more explicit (e.g.,enableGpuCustomCatalogSource
).Also applies to: 78-83
1221-1221
: Extract repeated NFD deployment logic into a shared helper.Both this GPU test and the NNO test use a similar
createNFDDeployment
approach. Factoring out repeated steps into a single shared utility can reduce duplication and future maintenance overhead.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pkg/nfd/consts.go
(1 hunks)tests/nvidiagpu/deploy-gpu-test.go
(18 hunks)tests/nvidianetwork/deploy-nno-test.go
(16 hunks)
🔇 Additional comments (4)
tests/nvidianetwork/deploy-nno-test.go (2)
49-54
: Approved usage ofUndefinedValue
for optional config placeholders.Using
UndefinedValue
instead of hard-coded"undefined"
is consistent with the newly introduced constants. This improves maintainability and standardizes how undefined states are handled.Also applies to: 58-64, 66-70
584-584
: Validate fallback for missing subscription channel.The logic routes to
nnoDefaultSubscriptionChannel
whennnoSubscriptionChannel
isUndefinedValue
. Confirm that external dependencies handle “fallback channels” properly if the real channel is not provided.✅ Verification successful
Fallback channel handling is properly implemented ✅
The fallback mechanism is robust and safe. When
nnoSubscriptionChannel
isUndefinedValue
, it correctly falls back tonnoDefaultSubscriptionChannel
which is dynamically retrieved from the package manifest'sStatus.DefaultChannel
. This ensures the fallback channel is always valid and officially supported.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Searching usage of nnoSubscriptionChannel and nnoDefaultSubscriptionChannel # to verify fallback logic correctness. rg -A 5 'nnoSubscriptionChannel | nnoDefaultSubscriptionChannel'Length of output: 3148
Script:
#!/bin/bash # Search for nnoDefaultSubscriptionChannel assignments and related config rg -B 5 -A 5 'nnoDefaultSubscriptionChannel\s*=' # Look for subscription builder implementation ast-grep --pattern 'type $_Builder struct { $$$ }' # Search for any config files with channel information fd -e yaml -e json . | xargs rg -l 'channel|subscription'Length of output: 3588
tests/nvidiagpu/deploy-gpu-test.go (2)
60-83
: Consistent adoption ofUndefinedValue
.Replacing the string
"undefined"
withUndefinedValue
standardizes the approach, making the code clearer to read and less prone to typos or mismatches. This is a well-aligned improvement.
144-144
: Smooth handling of upgrade scenarios.Using
UndefinedValue
to skip the GPU operator upgrade path is a neat approach. Ensure that any future logic that checks these variables doesn’t inadvertently compare them to empty strings or other sentinel values.Also applies to: 183-183, 936-936, 975-975
… package, and adjusted imports
… also logging adjusted accordingly so once you change the time const the loging automatically is updated
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
pkg/nvidiagpu/consts.go (1)
1-69
: LGTM! Well-structured constants with clear naming.The constants are logically grouped and use appropriate types. The package provides a comprehensive set of configuration values for the NVIDIA GPU operator.
Consider adding documentation comments for the time duration constants to explain their purpose and why specific durations were chosen. For example:
// MachineReadyWaitDuration is the maximum time to wait for a machine to be ready // This is set to 15 minutes to account for potential delays in machine provisioning // and initialization MachineReadyWaitDuration = 15 * time.Minutetests/nvidiagpu/deploy-gpu-test.go (3)
58-76
: Consider using structured configuration.The large number of global variables could be encapsulated in a configuration struct for better organization and maintainability.
Consider refactoring the variables into a structured configuration:
type GPUTestConfig struct { ScaleCluster bool CatalogSource string CustomCatalogSource string CreateCustomCatalogsource bool CustomCatalogsourceIndexImage string SubscriptionChannel string DefaultSubscriptionChannel string OperatorUpgradeToChannel string CleanupAfterTest bool DeployFromBundle bool OperatorBundleImage string CurrentCSV string CurrentCSVVersion string ClusterArchitecture string }
914-933
: Consider adding retry mechanism for pod operations.The pod operations could benefit from a retry mechanism to handle transient failures.
Consider implementing a retry mechanism for pod operations:
func waitForPodWithRetry(pod *pod.Builder, status corev1.PodPhase, timeout time.Duration, maxRetries int) error { for i := 0; i < maxRetries; i++ { err := pod.WaitUntilInStatus(status, timeout) if err == nil { return nil } glog.V(gpuparams.GpuLogLevel).Infof("Retry %d/%d: Failed to wait for pod status: %v", i+1, maxRetries, err) time.Sleep(time.Second * 5) } return fmt.Errorf("failed to wait for pod status after %d retries", maxRetries) }
1148-1157
: Add error details to timeout messages.The timeout error messages could be more descriptive to aid in debugging.
Enhance the error messages with more context:
-Expect(err).ToNot(HaveOccurred(), "timeout waiting for re-deployed gpu-burn pod in "+ - "namespace '%s' to go to Running phase: %v ", nvidiagpu.BurnNamespace, err) +Expect(err).ToNot(HaveOccurred(), "timeout waiting for re-deployed gpu-burn pod in "+ + "namespace '%s' to go to Running phase after %s: %v ", + nvidiagpu.BurnNamespace, + nvidiagpu.RedeployedBurnPodRunningTimeout, + err)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
pkg/nfd/consts.go
(1 hunks)pkg/nfd/nodefeaturediscovery.go
(11 hunks)pkg/nfd/vars.go
(1 hunks)pkg/nvidiagpu/consts.go
(1 hunks)tests/nvidiagpu/deploy-gpu-test.go
(41 hunks)tests/nvidianetwork/deploy-nno-test.go
(16 hunks)
✅ Files skipped from review due to trivial changes (2)
- pkg/nfd/vars.go
- pkg/nfd/nodefeaturediscovery.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/nfd/consts.go
- tests/nvidianetwork/deploy-nno-test.go
🔇 Additional comments (3)
tests/nvidiagpu/deploy-gpu-test.go (3)
44-45
: LGTM! Improved worker node selector with constants.The worker node selector now uses constants from the nvidiagpu package, improving maintainability.
187-201
: LGTM! Improved NFD catalog source handling.The code now properly handles NFD catalog source configuration using constants.
666-678
: LGTM! Better deployment timing management.The code now uses well-defined constants for deployment timing, making it easier to adjust timeouts.
2. fixed nfd consts names 3.fixed nfd var bug - create instance per package not global nfd
@coderabbitai review |
✅ Actions performedReview triggered.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (2)
tests/nvidiagpu/deploy-gpu-test.go (2)
Line range hint
1186-1204
: Extract NFD deployment configuration into a separate function.The NFD deployment logic contains hardcoded values for sleep and timeout durations. Consider extracting these into configuration or constants.
+const ( + nfdOperatorDeploymentWaitTime = 2 * time.Minute + nfdOperatorDeploymentTimeout = 5 * time.Minute + nfdOperatorDeploymentCheckInterval = 30 * time.Second +) func createNFDDeployment() bool { By("Deploy NFD Subscription in NFD namespace") err := deploy.CreateNFDSubscription(inittools.APIClient, Nfd.CatalogSource) Expect(err).ToNot(HaveOccurred(), "error creating NFD Subscription: %v", err) - By("Sleep for 2 minutes to allow the NFD Operator deployment to be created") + By(fmt.Sprintf("Sleep for %v to allow the NFD Operator deployment to be created", nfdOperatorDeploymentWaitTime)) glog.V(gpuparams.GpuLogLevel).Infof("Sleep for 2 minutes to allow the NFD Operator deployment" + " to be created") - time.Sleep(2 * time.Minute) + time.Sleep(nfdOperatorDeploymentWaitTime)
Line range hint
1209-1234
: Add error handling for pod deletion operations.The
deleteOLMPods
function should handle potential errors more gracefully and provide better error messages.func deleteOLMPods(apiClient *clients.Settings) error { + const olmNamespace = "openshift-operator-lifecycle-manager" + + deleteOpts := metav1.DeleteOptions{} + listOpts := metav1.ListOptions{} - olmNamespace := "openshift-operator-lifecycle-manager" glog.V(gpuparams.GpuLogLevel).Info("Deleting catalog operator pods") - if err := apiClient.Pods(olmNamespace).DeleteCollection(context.TODO(), - metav1.DeleteOptions{}, - metav1.ListOptions{LabelSelector: "app=catalog-operator"}); err != nil { + listOpts.LabelSelector = "app=catalog-operator" + if err := apiClient.Pods(olmNamespace).DeleteCollection(context.TODO(), deleteOpts, listOpts); err != nil { + return fmt.Errorf("failed to delete catalog operator pods: %w", err) + } glog.V(gpuparams.GpuLogLevel).Info("Deleting OLM operator pods") - if err := apiClient.Pods(olmNamespace).DeleteCollection( - context.TODO(), - metav1.DeleteOptions{}, - metav1.ListOptions{LabelSelector: "app=olm-operator"}); err != nil { + listOpts.LabelSelector = "app=olm-operator" + if err := apiClient.Pods(olmNamespace).DeleteCollection(context.TODO(), deleteOpts, listOpts); err != nil { + return fmt.Errorf("failed to delete OLM operator pods: %w", err) + } return nil }
🧹 Nitpick comments (4)
pkg/nfd/vars.go (1)
5-11
: Add documentation for the Config struct and its fields.The
Config
struct and its fields would benefit from documentation comments explaining their purpose and expected values. This helps other developers understand how to use this configuration correctly.+// Config holds the configuration settings for Node Feature Discovery (NFD) type Config struct { + // CustomCatalogSourceIndexImage specifies the index image for a custom catalog source CustomCatalogSourceIndexImage string + // CreateCustomCatalogsource indicates whether to create a custom catalog source CreateCustomCatalogsource bool + // CustomCatalogSource specifies the name of the custom catalog source CustomCatalogSource string + // CatalogSource specifies the name of the catalog source to use CatalogSource string + // CleanupAfterInstall indicates whether to clean up resources after installation CleanupAfterInstall bool }tests/nvidiagpu/deploy-gpu-test.go (3)
Line range hint
41-77
: Consider grouping related variables and adding documentation.The variable declarations could be better organized by grouping related variables together and adding documentation for each group. This improves code readability and maintainability.
+// NFD configuration var ( Nfd = nfd.NewConfig() +) +// GPU operator configuration var ( gpuInstallPlanApproval v1alpha1.Approval = "Automatic" gpuWorkerNodeSelector = map[string]string{ inittools.GeneralConfig.WorkerLabel: "", nvidiagpu.NvidiaGPULabel: "true", } // ... other GPU-related variables ) +// Machine configuration var ( machineSetNamespace = "openshift-machine-api" replicas int32 = 1 workerMachineSetLabel = "machine.openshift.io/cluster-api-machine-role" )
667-679
: Extract deployment wait logic into a helper function.The deployment wait logic is repeated in multiple places. Consider extracting it into a helper function to reduce code duplication.
+func waitForDeploymentCreation(apiClient *clients.Settings, deploymentName, namespace string) error { + glog.V(gpuparams.GpuLogLevel).Infof("Wait for up to %s for %s deployment to be created", + nvidiagpu.DeploymentCreationTimeout, deploymentName) + created := wait.DeploymentCreated( + apiClient, + deploymentName, + namespace, + nvidiagpu.DeploymentCreationCheckInterval, + nvidiagpu.DeploymentCreationTimeout) + if !created { + return fmt.Errorf("timed out waiting to deploy %s", deploymentName) + } + return nil +}
915-925
: Improve error messages with more context.The error messages could be more descriptive by including relevant context such as timeout durations and expected states.
-Expect(err).ToNot(HaveOccurred(), "timeout waiting for gpu-burn pod in "+ - "namespace '%s' to go to Running phase: %v ", nvidiagpu.BurnNamespace, err) +Expect(err).ToNot(HaveOccurred(), "timeout after %s waiting for gpu-burn pod in "+ + "namespace '%s' to transition from %s to Running phase: %v", + nvidiagpu.BurnPodRunningTimeout, + nvidiagpu.BurnNamespace, + gpuPodPulled.Definition.Status.Phase, + err)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
pkg/global/const.go
(1 hunks)pkg/nfd/consts.go
(1 hunks)pkg/nfd/vars.go
(1 hunks)tests/nvidiagpu/deploy-gpu-test.go
(41 hunks)tests/nvidianetwork/deploy-nno-test.go
(17 hunks)
✅ Files skipped from review due to trivial changes (1)
- pkg/global/const.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/nfd/consts.go
- tests/nvidianetwork/deploy-nno-test.go
🔇 Additional comments (1)
pkg/nfd/vars.go (1)
13-21
: LGTM! Constructor implementation looks good.The
NewConfig
function properly initializes all fields with appropriate default values, using theUndefinedValue
constant for string fields andfalse
for boolean fields.
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
tests/nvidianetwork/deploy-nno-test.go (1)
Line range hint
344-363
: Improve error handling in cleanup functions.The cleanup code ignores errors which could mask issues during teardown.
Apply this diff to handle errors:
if Nfd.CleanupAfterInstall && cleanupAfterTest { - _ = deploy.NFDCRDeleteAndWait(inittools.APIClient, nfd.CRName, nfd.OperatorNamespace, 30*time.Second, + if err := deploy.NFDCRDeleteAndWait(inittools.APIClient, nfd.CRName, nfd.OperatorNamespace, 30*time.Second, 5*time.Minute) + if err != nil { + glog.Errorf("Failed to delete NFD CR: %v", err) + } By("Delete NFD CSV") - _ = deploy.DeleteNFDCSV(inittools.APIClient) + if err := deploy.DeleteNFDCSV(inittools.APIClient); err != nil { + glog.Errorf("Failed to delete NFD CSV: %v", err) + }
♻️ Duplicate comments (1)
tests/nvidiagpu/deploy-gpu-test.go (1)
256-261
: 🛠️ Refactor suggestionUse constants for sleep and wait durations.
The code uses magic numbers for sleep and wait durations. Consider defining these as named constants in the
nvidiagpu
package for better maintainability.Also applies to: 667-679, 791-794, 915-925, 1149-1158
🧹 Nitpick comments (5)
pkg/nfd/consts.go (2)
1-17
: Add package and constant documentation.The constants would benefit from:
- Package documentation explaining the purpose and usage of these NFD-related constants
- Grouping related constants with comments (e.g., catalog source related, label related)
- Documentation for non-obvious constants like
LogLevel
Apply this diff to improve documentation:
+// Package nfd provides constants and configuration for Node Feature Discovery (NFD) package nfd const ( + // Catalog Source related constants CustomNFDCatalogSourcePublisherName = "Red Hat" CustomCatalogSourceDisplayName = "Redhat Operators Custom" + + // Node labeling related constants RhcosLabel = "feature.node.kubernetes.io/system-os_release.ID" RhcosLabelValue = "rhcos" + + // Namespace and deployment related constants OperatorNamespace = "openshift-nfd" CatalogSourceDefault = "redhat-operators" CatalogSourceNamespace = "openshift-marketplace" OperatorDeploymentName = "nfd-controller-manager" Package = "nfd" CRName = "nfd-instance" + // Resource related constants resourceCRD = "NodeFeatureDiscovery" + // LogLevel defines the verbosity level for logging (100 enables debug logs) LogLevel = 100 )
4-5
: Consider consolidating duplicate publisher constants.The publisher name "Red Hat" is duplicated in the nvidiagpu package. Consider moving common constants to a shared package.
Create a new shared package (e.g.,
pkg/shared/constants.go
) for constants used across multiple packages.Also applies to: 23-25
pkg/nfd/vars.go (1)
5-11
: Add struct and field documentation.The
Config
struct and its fields would benefit from documentation explaining their purpose and valid values.Apply this diff to add documentation:
+// Config holds the configuration for Node Feature Discovery (NFD) type Config struct { + // CustomCatalogSourceIndexImage is the index image for custom catalog source CustomCatalogSourceIndexImage string + // CreateCustomCatalogsource indicates whether to create a custom catalog source CreateCustomCatalogsource bool + // CustomCatalogSource is the name of the custom catalog source CustomCatalogSource string + // CatalogSource is the name of the catalog source to use CatalogSource string + // CleanupAfterInstall indicates whether to clean up resources after installation CleanupAfterInstall bool }pkg/nvidiagpu/consts.go (1)
27-68
: Consolidate and standardize time duration constants.The time-related constants could be improved by:
- Using consistent units (minutes vs seconds)
- Consolidating similar durations
- Using more descriptive names for timeout constants
Apply this diff to standardize the durations:
- SleepDuration = 30 * time.Second - WaitDuration = 4 * time.Minute - DeletionPollInterval = 30 * time.Second + // Common intervals + DefaultPollInterval = 30 * time.Second + DefaultWaitDuration = 4 * time.Minute + DefaultDeletionTimeout = 5 * time.Minute - BurnLogCollectionPeriod = 500 * time.Second - RedeployedBurnLogCollectionPeriod = 500 * time.Second + // Log collection periods (standardized to minutes) + BurnLogCollectionPeriod = 8 * time.Minute + RedeployedBurnLogCollectionPeriod = 8 * time.Minutetests/nvidianetwork/deploy-nno-test.go (1)
Line range hint
1-817
: Consider splitting the test file into smaller files.The test file is quite long and handles multiple responsibilities. Consider splitting it into smaller files organized by test functionality:
- NFD deployment tests
- Network operator deployment tests
- Cleanup utilities
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
pkg/global/const.go
(1 hunks)pkg/nfd/consts.go
(1 hunks)pkg/nfd/nodefeaturediscovery.go
(11 hunks)pkg/nfd/vars.go
(1 hunks)pkg/nvidiagpu/consts.go
(1 hunks)tests/nvidiagpu/deploy-gpu-test.go
(41 hunks)tests/nvidianetwork/deploy-nno-test.go
(17 hunks)
🔇 Additional comments (5)
pkg/global/const.go (1)
1-8
: LGTM! Well-structured constants package.The constants are well-documented and follow Go naming conventions. The package organization promotes code reuse by centralizing common constants.
pkg/nfd/nodefeaturediscovery.go (1)
Line range hint
33-258
: LGTM! Consistent use of LogLevel constant.The changes improve maintainability by replacing hardcoded log level values with a centralized constant. This makes it easier to adjust logging verbosity across the entire package.
tests/nvidiagpu/deploy-gpu-test.go (3)
12-15
: LGTM! Well-organized imports.The new imports are properly organized and use the new packages containing the extracted constants.
Line range hint
41-77
: LGTM! Improved variable declarations using constants.The changes properly utilize constants from the global package and initialize the Nfd configuration. This improves maintainability by centralizing constant values.
915-925
: Consider using context with timeout for long-running operations.Long-running operations like waiting for pod status should use context with timeout to ensure proper cleanup and resource management.
+ctx, cancel := context.WithTimeout(context.Background(), nvidiagpu.BurnPodRunningTimeout) +defer cancel() -err = gpuPodPulled.WaitUntilInStatus(corev1.PodRunning, nvidiagpu.BurnPodRunningTimeout) +err = gpuPodPulled.WaitUntilInStatusWithContext(ctx, corev1.PodRunning)Also applies to: 1149-1158
…urn test * the gpu burn test passed on a single node cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/nvidiagpu/deploy-gpu-test.go (2)
259-260
: Consider using proper synchronization instead of sleep.Using
time.Sleep
for synchronization is not ideal. Consider using a polling mechanism with context and timeout.-By(fmt.Sprintf("Sleep for %s to allow the NFD custom catalogsource to be created", nvidiagpu.SleepDuration.String())) -time.Sleep(nvidiagpu.SleepDuration) +ctx, cancel := context.WithTimeout(context.Background(), nvidiagpu.SleepDuration) +defer cancel() +err = wait.PollUntilContextCancel(ctx, time.Second, true, func(ctx context.Context) (bool, error) { + // Check if catalogsource exists and is initialized + return true, nil +})
661-663
: Replace sleep with proper wait mechanism.Using
time.Sleep
for deployment synchronization is not reliable. Consider using a proper wait mechanism.-By(fmt.Sprintf("Sleep for %s to allow the GPU Operator deployment to be created", nvidiagpu.OperatorDeploymentCreationDelay)) -glog.V(gpuparams.GpuLogLevel).Infof("Sleep for %s to allow the GPU Operator deployment to be created", nvidiagpu.OperatorDeploymentCreationDelay) -time.Sleep(nvidiagpu.OperatorDeploymentCreationDelay) +ctx, cancel := context.WithTimeout(context.Background(), nvidiagpu.OperatorDeploymentCreationDelay) +defer cancel() +err = wait.PollUntilContextCancel(ctx, time.Second, true, func(ctx context.Context) (bool, error) { + // Check if deployment exists + return true, nil +})
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pkg/nvidiagpu/consts.go
(1 hunks)pkg/nvidiagpu/gpuBurnConfig.go
(1 hunks)tests/nvidiagpu/deploy-gpu-test.go
(40 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/nvidiagpu/consts.go
🔇 Additional comments (6)
pkg/nvidiagpu/gpuBurnConfig.go (2)
3-10
: Well-structured configuration struct with clear documentation.The
GPUBurnConfig
struct is well-organized, with clear field names and comprehensive documentation explaining its purpose and domain context.
12-22
: LGTM! Constructor provides sensible defaults.The
NewDefaultGPUBurnConfig
constructor is well-documented and provides reasonable default values for all configuration fields.tests/nvidiagpu/deploy-gpu-test.go (4)
10-16
: LGTM! Improved organization of imports.The imports are well-organized, with new packages for NFD and GPU-related functionality properly imported.
41-80
: Well-structured variable declarations using package constants.The global variables are now properly using constants from dedicated packages, improving maintainability and consistency.
711-713
: Improve error handling for report writing.The error from WriteReport is only logged. Consider failing the test or at least adding a warning since this could indicate issues with reporting test results.
573-578
: LGTM! Well-structured cleanup logic.The cleanup operations are properly implemented using defer statements and respect the cleanup configuration flags.
Also applies to: 740-744, 842-847
* the gpu burn test passed on a single node cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/nfd/config.go (2)
3-3
: Avoid dot import for better code clarity.Using dot imports (
.
) can lead to naming conflicts and make it harder to track where symbols are coming from. Consider using a regular import instead:-import . "github.com/rh-ecosystem-edge/nvidia-ci/pkg/global" +import "github.com/rh-ecosystem-edge/nvidia-ci/pkg/global"Then update references to
UndefinedValue
toglobal.UndefinedValue
.
5-11
:❓ Verification inconclusive
Add documentation for the struct and its fields.
The struct lacks documentation explaining its purpose and the role of each field. Consider adding comments:
+// CustomConfig holds configuration for NFD custom catalog sources type CustomConfig struct { + // CustomCatalogSourceIndexImage specifies the index image for the custom catalog source CustomCatalogSourceIndexImage string + // CreateCustomCatalogsource determines whether to create a custom catalog source CreateCustomCatalogsource bool + // CustomCatalogSource specifies the name of the custom catalog source CustomCatalogSource string + // CatalogSource specifies the name of the catalog source CatalogSource string + // CleanupAfterInstall determines whether to clean up resources after installation CleanupAfterInstall bool }Verify required fields and methods.
Based on the reported compilation issues in
nvidianetwork
tests, there might be missing fields or methods expected by the test suite. Please verify if additional fields or methods are needed.
🏁 Script executed:
#!/bin/bash # Search for usage of nfd.Config type to identify expected fields and methods rg -t go "nfd\.Config\." -A 5Length of output: 31
Document CustomConfig and Verify Test Integration
Please add documentation to the
CustomConfig
struct to clarify its purpose and the role of each field. For example:+// CustomConfig holds configuration for NFD custom catalog sources. type CustomConfig struct { + // CustomCatalogSourceIndexImage specifies the index image for the custom catalog source. CustomCatalogSourceIndexImage string + // CreateCustomCatalogsource determines whether to create a custom catalog source. CreateCustomCatalogsource bool + // CustomCatalogSource specifies the custom catalog source's name. CustomCatalogSource string + // CatalogSource specifies the catalog source's name. CatalogSource string + // CleanupAfterInstall indicates whether to clean up resources after installation. CleanupAfterInstall bool }In addition, the reported compilation issues in the
nvidianetwork
tests suggest that the test suite might be expecting additional fields or methods. Since our initial automated search for references tonfd.Config
produced no output, please manually verify whether:
- Additional fields or methods are required on this struct (or its related types) to meet the expectations of the tests.
- The refactoring to consolidate NFD constants has not inadvertently removed or altered necessary functionality.
…ncerns) pkg/nfd: Manages NFD resource creation and verification pkg/nfdcheck: Verifies that NFD is correctly installed and configured * the gpu burn test passed on a single node cluster
added comments Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…nternal from pkg added these funcs: EnsureNFDIsInstalled DeployNFDOperatorWithRetries 1.gpu-used the new funcs here 2.nno - havent added them here yet * the gpu burn test passed on a single node cluster DeleteOLMPods- needs to be put in a diffrent package its currently in nfd pkg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
pkg/nfd/deploy-nfd.go (1)
411-412
:⚠️ Potential issueUse Infof(...) instead of Info(...) for format strings.
You are passing format specifiers ("%s"
) and multiple arguments intoglog.Info
, which does not interpret them as format strings. UseInfof(...)
to properly format the log message.-glog.V(gpuparams.GpuLogLevel).Info("Deleting NFD Subscription '%s' in namespace '%s'", nfdSubscriptionName, nfdOperatorNamespace) +glog.V(gpuparams.GpuLogLevel).Infof("Deleting NFD Subscription '%s' in namespace '%s'", nfdSubscriptionName, nfdOperatorNamespace)
🧹 Nitpick comments (8)
pkg/nfd/utils.go (2)
88-90
: Clarify the test skip scenario.
The test callsSkip(...)
if thenfd
package manifest does not exist in the default catalog and no custom catalogsource is allowed. Ensure that skipping here is an intentional test expectation rather than leaving the environment partially tested.
101-105
: Remove or refine commented-out code.
The comment referencing "I don't need to pass fd.OperatorDeploymentName…" appears outdated or unclear. Removing or clarifying it would improve maintainability and readability.pkg/nfd/config.go (1)
30-58
: Optional extension of documentation for NfdParams struct.
Consider adding explanations or examples of typical usage for each field (e.g., clarifying how the operator channel is selected). This can help future contributors quickly understand how to configure the NFD operator.pkg/nfd/deploy-nfd.go (1)
483-485
: Replace the fixed 2-minute sleep with a smarter wait approach.
Consider replacing the unconditionaltime.Sleep(2 * time.Minute)
with a loop that checks operator status at intervals. This approach can reduce test runtime when the operator is ready sooner and still accommodate slower setups.tests/nvidiagpu/deploy-gpu-test.go (4)
48-51
: Consider using a constant for the worker node selector map.Since the
WorkerNodeSelector
map is used for configuration and doesn't change during runtime, consider making it a constant usingmap[string]string
literal.-WorkerNodeSelector = map[string]string{ +const WorkerNodeSelector = map[string]string{
219-220
: Remove TODO comments.The TODO comment about passing Nfd instance by pointer should be addressed or removed.
Would you like me to help implement the pointer-based approach for the Nfd instance?
327-330
: Consider using context for sleep operations.Instead of using
time.Sleep
, consider usingcontext.WithTimeout
for better control and testability.-if ScaleCluster { - glog.V(gpuparams.GpuLogLevel).Infof("Sleeping for %s to allow the newly created GPU worker node to be labeled by NFD", nvidiagpu.NodeLabelingDelay.String()) - time.Sleep(nvidiagpu.NodeLabelingDelay) +if ScaleCluster { + ctx, cancel := context.WithTimeout(context.Background(), nvidiagpu.NodeLabelingDelay) + defer cancel() + glog.V(gpuparams.GpuLogLevel).Infof("Waiting for %s to allow the newly created GPU worker node to be labeled by NFD", nvidiagpu.NodeLabelingDelay.String()) + <-ctx.Done()
792-798
: Improve cleanup condition readability.The cleanup condition could be more readable by extracting the condition into a named variable.
+shouldCleanup := cleanupAfterTest && OperatorUpgradeToChannel == UndefinedValue defer func() { - if cleanupAfterTest && OperatorUpgradeToChannel == UndefinedValue { + if shouldCleanup { _, err := gpuPodPulled.Delete() Expect(err).ToNot(HaveOccurred()) } }()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
internal/nvidiagpuconfig/config.go
(1 hunks)internal/nvidianetworkconfig/config.go
(1 hunks)pkg/nfd/config.go
(1 hunks)pkg/nfd/deploy-nfd.go
(7 hunks)pkg/nfd/utils.go
(1 hunks)tests/nvidiagpu/deploy-gpu-test.go
(28 hunks)tests/nvidianetwork/deploy-nno-test.go
(20 hunks)
✅ Files skipped from review due to trivial changes (2)
- internal/nvidiagpuconfig/config.go
- internal/nvidianetworkconfig/config.go
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/nvidianetwork/deploy-nno-test.go
🔇 Additional comments (4)
pkg/nfd/utils.go (1)
29-35
: Consider handling errors distinctly from the “not installed” scenario.
Currently, an error fromcheck.NFDDeploymentsReady(apiClient)
is treated the same as “NFD not installed.” If the error is transient (e.g., network issue), you might inadvertently redeploy NFD. Consider differentiating between a genuine “not installed” state and an unexpected error.pkg/nfd/config.go (1)
17-19
: Documentation is well-structured.
The docstring forNewCustomConfig
clearly explains default values for each field, which is helpful for maintainability. Good job!tests/nvidiagpu/deploy-gpu-test.go (2)
43-44
: LGTM! Good use of constructor functions.The initialization of
Nfd
andburn
using constructor functions improves code organization and maintainability.
10-17
:❓ Verification inconclusive
Verify the necessity of blank import.
The blank import of the clients package (
_ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients"
) suggests side effects are being used. This is generally not recommended unless absolutely necessary.
🏁 Script executed:
#!/bin/bash # Check if the clients package has init() functions or other side effects rg -A 5 "func init\(\)" pkg/clients/Length of output: 38
Blank Import Necessity Verification
Based on the search for
init()
functions in thepkg/clients
package, no evidence of explicit side-effect code (e.g., aninit()
method) was detected. This suggests that the blank import intests/nvidiagpu/deploy-gpu-test.go
:_ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients"may not be triggering any initialization side effects. If the blank import was originally intended to cause side effects via implicit registration or variable initialization, please confirm that such behavior is indeed occurring. Otherwise, consider removing this import for clarity and maintainability, or add a comment clarifying its necessity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/nvidianetwork/deploy-nno-test.go (1)
252-252
: Consider handling potential errors from NFD installation check.The
CheckNfdInstallation
function call doesn't handle potential errors. Consider capturing and asserting the error to ensure proper test behavior.-nfdcheck.CheckNfdInstallation(inittools.APIClient, nfd.RhcosLabel, nfd.RhcosLabelValue, inittools.GeneralConfig.WorkerLabelMap, networkparams.LogLevel) +err := nfdcheck.CheckNfdInstallation(inittools.APIClient, nfd.RhcosLabel, nfd.RhcosLabelValue, inittools.GeneralConfig.WorkerLabelMap, networkparams.LogLevel) +Expect(err).ToNot(HaveOccurred(), "error checking NFD installation: %v", err)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/nvidianetwork/deploy-nno-test.go
(17 hunks)
🔇 Additional comments (3)
tests/nvidianetwork/deploy-nno-test.go (3)
7-8
: LGTM! Import changes align with package restructuring.The new imports correctly reflect the reorganization of NFD-related code into dedicated packages.
Also applies to: 13-15
32-32
: LGTM! NFD configuration centralization.Centralizing NFD configuration through a dedicated struct improves code organization and maintainability.
214-214
: LGTM! Consistent use of NFD package for setup and cleanup.The test setup and cleanup now consistently use the NFD package functions, improving code organization.
Also applies to: 228-246
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm. Tested NNO testcase successfully from PR branch
@@ -16,7 +16,7 @@ type NvidiaGPUConfig struct { | |||
BundleImage string `envconfig:"NVIDIAGPU_BUNDLE_IMAGE"` | |||
OperatorUpgradeToChannel string `envconfig:"NVIDIAGPU_SUBSCRIPTION_UPGRADE_TO_CHANNEL"` | |||
GPUFallbackCatalogsourceIndexImage string `envconfig:"NVIDIAGPU_GPU_FALLBACK_CATALOGSOURCE_INDEX_IMAGE"` | |||
NFDFallbackCatalogsourceIndexImage string `envconfig:"NVIDIAGPU_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE"` | |||
NFDFallbackCatalogsourceIndexImage string `envconfig:"NVIDIAGPU_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE"` // should this be here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it shouldn't. It belongs to the NFD package and shouldn't be specific to GPU, must cover any scenario that involves NFD without ambiguities - NFD + GPU, NFD + NNO, NFD + GPU + NNO. Which means we'll have a single NFD_FALLBACK_CATALOGSOURCEC_INDEX_IMAGE
instead of NVIDIAGPU_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE
and NVIDIANETWORK_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE
.
For backward compatibility, we may have a translation NVIDIAGPU_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE
> NFD_FALLBACK_CATALOGSOURCEC_INDEX_IMAGE
(and a similar one for NNO) until the CI jobs are fixed.
type CustomConfig struct { | ||
CustomCatalogSourceIndexImage string | ||
CreateCustomCatalogsource bool | ||
|
||
CustomCatalogSource string | ||
CatalogSource string | ||
CleanupAfterInstall bool | ||
} | ||
|
||
// NewCustomConfig creates a new CustomConfig instance with default settings. | ||
// All string fields are initialized to UndefinedValue and boolean fields to false. | ||
func NewCustomConfig() *CustomConfig { | ||
return &CustomConfig{ | ||
CustomCatalogSourceIndexImage: UndefinedValue, | ||
CreateCustomCatalogsource: false, | ||
|
||
CustomCatalogSource: UndefinedValue, | ||
CatalogSource: UndefinedValue, | ||
CleanupAfterInstall: false, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this can be shared by all operators that might need a custom catalog source, not limited to NFD.
|
||
// NfdParams holds all the configuration details required to install or manage | ||
// the Node Feature Discovery (NFD) operator on a cluster. | ||
type NfdParams struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this has to be NFDParams
according to Go conventions.
if Nfd.CleanupAfterInstall && cleanupAfterTest { | ||
// Here need to check if NFD CR is deployed, otherwise Deleting a non-existing CR will throw an error | ||
// skipping error check for now cause any failure before entire NFD stack | ||
By("Delete NFD CR instance in NFD namespace") | ||
_ = deploy.NFDCRDeleteAndWait(inittools.APIClient, nfdCRName, nfdOperatorNamespace, 30*time.Second, 5*time.Minute) | ||
_ = nfd.NFDCRDeleteAndWait(inittools.APIClient, nfd.CRName, nfd.OperatorNamespace, nvidiagpu.DeletionPollInterval, nvidiagpu.DeletionTimeoutDuration) | ||
|
||
By("Delete NFD CSV") | ||
_ = deploy.DeleteNFDCSV(inittools.APIClient) | ||
_ = nfd.DeleteNFDCSV(inittools.APIClient) | ||
|
||
By("Delete NFD Subscription in NFD namespace") | ||
_ = deploy.DeleteNFDSubscription(inittools.APIClient) | ||
_ = nfd.DeleteNFDSubscription(inittools.APIClient) | ||
|
||
By("Delete NFD OperatorGroup in NFD namespace") | ||
_ = deploy.DeleteNFDOperatorGroup(inittools.APIClient) | ||
_ = nfd.DeleteNFDOperatorGroup(inittools.APIClient) | ||
|
||
By("Delete NFD Namespace in NFD namespace") | ||
_ = deploy.DeleteNFDNamespace(inittools.APIClient) | ||
_ = nfd.DeleteNFDNamespace(inittools.APIClient) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this functionality should be in the NFD package, GPU/NNO tests shouldn't care what a cleanup means for NFD, and will just "ask" it to clean up itself
/lgtm with comments |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AdamKaabyia, empovit The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@AdamKaabyia do we want to merge it into main? |
no, for now only in package-refactoring |
/ok-to-test |
/lgtm |
7d2847d
into
rh-ecosystem-edge:package-refactoring
extracted common consts and nfd consts to the nfd package, adjusted imports
Summary by CodeRabbit
New Features
Refactor
Chores