Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nfd refractoring #55

Conversation

AdamKaabyia
Copy link
Contributor

@AdamKaabyia AdamKaabyia commented Jan 24, 2025

extracted common consts and nfd consts to the nfd package, adjusted imports

Summary by CodeRabbit

  • New Features

    • Introduced a new package for Node Feature Discovery (NFD) with constants for configuration and deployment management.
    • Added functionality for GPU burn tests with a dedicated configuration struct.
    • Implemented automated checks and installation routines for NFD operator deployment.
  • Refactor

    • Streamlined configuration management for NFD and GPU functionalities.
    • Enhanced logging practices for better clarity and consistency during deployment and testing.
  • Chores

    • Upgraded external dependencies for logging and YAML processing.
    • Reorganized code structure to unify constants and configuration settings across components.

Guy Gordani and others added 6 commits December 25, 2024 09:06
Copy link
Contributor

coderabbitai bot commented Jan 24, 2025

Walkthrough

The changes introduce new packages and constants for Node Feature Discovery (NFD) and NVIDIA GPU functionality. Multiple configuration structs and functions have been added to streamline deployments and testing of NFD operators, including managing PCI device configurations and custom catalog sources. Test files are refactored to utilize these new constants and configurations. Additionally, dependency versions are updated in the module files while vendor libraries (glog) receive logging and platform-specific enhancements.

Changes

Files Change Summary
pkg/nfd/{consts.go, nodefeaturediscovery.go, config.go, deploy-nfd.go, utils.go, nodefeaturediscoveryconfig.go} Introduced new constants, configuration structs (e.g., CustomConfig, NfdParams), and functions (e.g., UpdatePciDevices, CreateNFDDeployment, EnsureNFDIsInstalled) to manage NFD operator installations, logging, and PCI device configurations.
pkg/nvidiagpu/{consts.go, gpuBurnConfig.go} Added a new package with constants for NVIDIA GPU operator configuration and a GPUBurnConfig struct with a helper constructor for GPU burn tests.
pkg/global/const.go Added common constants (UndefinedValue, OperatorVersionFile, OpenShiftVersionFile) shared between GPU and NFD components.
tests/{nvidiagpu/deploy-gpu-test.go, nvidianetwork/deploy-nno-test.go} Refactored test configurations to replace hardcoded strings with new constants and centralized structures; removed redundant function calls.
go.mod Updated dependency versions (e.g., github.com/golang/glog from v1.2.0 to v1.2.4) and adjusted YAML dependency management.
vendor/github.com/golang/glog/{glog.go, glog_file.go, glog_file_nonwindows.go, glog_file_windows.go} Modified logging functionality: improved file creation (with directory support), optimized log flushing, and enhanced user lookup for cross-platform support.
vendor/modules.txt Reflects updated module versions, notably for glog.
pkg/nfdcheck/checknfd.go Added a new function to verify NFD installation and deployment readiness.
internal/{nvidiagpuconfig/config.go, nvidianetworkconfig/config.go} Minor comment updates without functional changes.

Possibly related PRs

Suggested labels

approved, lgtm

Suggested reviewers

  • wabouhamad
  • empovit
  • TomerNewman

Poem

I’m a little rabbit with a code-filled heart,
Hopping through constants and functions—each a work of art.
NFD and GPU configs now dance in delight,
Logs and deployments shining ever so bright.
With every new line, my whiskers twitch in glee,
Celebrating these changes like carrots in a spree!
🐰 Happy coding, one hop at a time!

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

openshift-ci bot commented Jan 24, 2025

Hi @AdamKaabyia. Thanks for your PR.

I'm waiting for a rh-ecosystem-edge member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AdamKaabyia
Copy link
Contributor Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Jan 24, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
pkg/nfd/consts.go (2)

3-19: Consolidate related constants into cohesive sections.

All the constants are well-defined but could benefit from grouping. For example, place all NFD-related constants together, and separate the “common” constants such as UndefinedValue, OperatorVersionFile, and OpenShiftVersionFile into a distinct block for clarity.

 const (
+	// NFD constants
 	NfdCustomNFDCatalogSourcePublisherName = "Red Hat"
 	NfdCustomCatalogSourceDisplayName      = "Redhat Operators Custom"
 	NfdRhcosLabel                          = "feature.node.kubernetes.io/system-os_release.ID"
 	NfdRhcosLabelValue                     = "rhcos"
 	NfdOperatorNamespace                   = "openshift-nfd"
 	NfdCatalogSourceDefault                = "redhat-operators"
 	NfdCatalogSourceNamespace              = "openshift-marketplace"
 	NfdOperatorDeploymentName              = "nfd-controller-manager"
 	NfdPackage                             = "nfd"
 	NfdCRName                              = "nfd-instance"

-	//not related to NFD but common consts between gpu and nno
+	// Common constants
 	UndefinedValue       = "undefined"
 	OperatorVersionFile  = "operator.version"
 	OpenShiftVersionFile = "ocp.version"
 )

3-19: Document the purpose of the constants.

Although these constants are self-explanatory, adding short Go doc comments for key constants can help downstream maintainers by clarifying where and how each constant should be used.

tests/nvidianetwork/deploy-nno-test.go (3)

62-62: Consider clarifying the custom-catalog-source creation logic.

The boolean createNNOCustomCatalogsource can be more descriptive, for instance enableCustomNnoCatalogSource, to clearly convey that the catalog creation is conditional.

- createNNOCustomCatalogsource bool = false
+ enableCustomNnoCatalogSource bool = false

66-70: Avoid repeating the “NFD vars” comment multiple times.

While this comment clarifies context, the logic here is mostly consistent with the GPU flow that also deals with NFD. Consider either removing or renaming the comment for better clarity.


787-787: Check concurrency for parallel tests.

Because the code introduces a timed wait for the NFD operator deployment, ensure that multiple test suites running in parallel do not interfere (e.g., by creating or deleting the same operator). If parallel runs are intended, incorporate additional namespaces or concurrency guards where necessary.

tests/nvidiagpu/deploy-gpu-test.go (2)

62-65: Refine variable names for readability.

Variables like createGPUCustomCatalogsource and createNFDCustomCatalogsource can be slightly ambiguous. Prefer prefixes like enable... to make the usage more explicit (e.g., enableGpuCustomCatalogSource).

Also applies to: 78-83


1221-1221: Extract repeated NFD deployment logic into a shared helper.

Both this GPU test and the NNO test use a similar createNFDDeployment approach. Factoring out repeated steps into a single shared utility can reduce duplication and future maintenance overhead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6350bc8 and 73d13c8.

📒 Files selected for processing (3)
  • pkg/nfd/consts.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (18 hunks)
  • tests/nvidianetwork/deploy-nno-test.go (16 hunks)
🔇 Additional comments (4)
tests/nvidianetwork/deploy-nno-test.go (2)

49-54: Approved usage of UndefinedValue for optional config placeholders.

Using UndefinedValue instead of hard-coded "undefined" is consistent with the newly introduced constants. This improves maintainability and standardizes how undefined states are handled.

Also applies to: 58-64, 66-70


584-584: Validate fallback for missing subscription channel.

The logic routes to nnoDefaultSubscriptionChannel when nnoSubscriptionChannel is UndefinedValue. Confirm that external dependencies handle “fallback channels” properly if the real channel is not provided.

✅ Verification successful

Fallback channel handling is properly implemented

The fallback mechanism is robust and safe. When nnoSubscriptionChannel is UndefinedValue, it correctly falls back to nnoDefaultSubscriptionChannel which is dynamically retrieved from the package manifest's Status.DefaultChannel. This ensures the fallback channel is always valid and officially supported.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching usage of nnoSubscriptionChannel and nnoDefaultSubscriptionChannel
# to verify fallback logic correctness.
rg -A 5 'nnoSubscriptionChannel | nnoDefaultSubscriptionChannel'

Length of output: 3148


Script:

#!/bin/bash
# Search for nnoDefaultSubscriptionChannel assignments and related config
rg -B 5 -A 5 'nnoDefaultSubscriptionChannel\s*=' 

# Look for subscription builder implementation
ast-grep --pattern 'type $_Builder struct {
  $$$
}'

# Search for any config files with channel information
fd -e yaml -e json . | xargs rg -l 'channel|subscription'

Length of output: 3588

tests/nvidiagpu/deploy-gpu-test.go (2)

60-83: Consistent adoption of UndefinedValue.

Replacing the string "undefined" with UndefinedValue standardizes the approach, making the code clearer to read and less prone to typos or mismatches. This is a well-aligned improvement.


144-144: Smooth handling of upgrade scenarios.

Using UndefinedValue to skip the GPU operator upgrade path is a neat approach. Ensure that any future logic that checks these variables doesn’t inadvertently compare them to empty strings or other sentinel values.

Also applies to: 183-183, 936-936, 975-975

@AdamKaabyia
Copy link
Contributor Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Jan 25, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
pkg/nvidiagpu/consts.go (1)

1-69: LGTM! Well-structured constants with clear naming.

The constants are logically grouped and use appropriate types. The package provides a comprehensive set of configuration values for the NVIDIA GPU operator.

Consider adding documentation comments for the time duration constants to explain their purpose and why specific durations were chosen. For example:

// MachineReadyWaitDuration is the maximum time to wait for a machine to be ready
// This is set to 15 minutes to account for potential delays in machine provisioning
// and initialization
MachineReadyWaitDuration = 15 * time.Minute
tests/nvidiagpu/deploy-gpu-test.go (3)

58-76: Consider using structured configuration.

The large number of global variables could be encapsulated in a configuration struct for better organization and maintainability.

Consider refactoring the variables into a structured configuration:

type GPUTestConfig struct {
    ScaleCluster bool
    CatalogSource string
    CustomCatalogSource string
    CreateCustomCatalogsource bool
    CustomCatalogsourceIndexImage string
    SubscriptionChannel string
    DefaultSubscriptionChannel string
    OperatorUpgradeToChannel string
    CleanupAfterTest bool
    DeployFromBundle bool
    OperatorBundleImage string
    CurrentCSV string
    CurrentCSVVersion string
    ClusterArchitecture string
}

914-933: Consider adding retry mechanism for pod operations.

The pod operations could benefit from a retry mechanism to handle transient failures.

Consider implementing a retry mechanism for pod operations:

func waitForPodWithRetry(pod *pod.Builder, status corev1.PodPhase, timeout time.Duration, maxRetries int) error {
    for i := 0; i < maxRetries; i++ {
        err := pod.WaitUntilInStatus(status, timeout)
        if err == nil {
            return nil
        }
        glog.V(gpuparams.GpuLogLevel).Infof("Retry %d/%d: Failed to wait for pod status: %v", i+1, maxRetries, err)
        time.Sleep(time.Second * 5)
    }
    return fmt.Errorf("failed to wait for pod status after %d retries", maxRetries)
}

1148-1157: Add error details to timeout messages.

The timeout error messages could be more descriptive to aid in debugging.

Enhance the error messages with more context:

-Expect(err).ToNot(HaveOccurred(), "timeout waiting for re-deployed gpu-burn pod in "+
-    "namespace '%s' to go to Running phase:  %v ", nvidiagpu.BurnNamespace, err)
+Expect(err).ToNot(HaveOccurred(), "timeout waiting for re-deployed gpu-burn pod in "+
+    "namespace '%s' to go to Running phase after %s:  %v ", 
+    nvidiagpu.BurnNamespace, 
+    nvidiagpu.RedeployedBurnPodRunningTimeout,
+    err)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73d13c8 and 6f851da.

📒 Files selected for processing (6)
  • pkg/nfd/consts.go (1 hunks)
  • pkg/nfd/nodefeaturediscovery.go (11 hunks)
  • pkg/nfd/vars.go (1 hunks)
  • pkg/nvidiagpu/consts.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (41 hunks)
  • tests/nvidianetwork/deploy-nno-test.go (16 hunks)
✅ Files skipped from review due to trivial changes (2)
  • pkg/nfd/vars.go
  • pkg/nfd/nodefeaturediscovery.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/nfd/consts.go
  • tests/nvidianetwork/deploy-nno-test.go
🔇 Additional comments (3)
tests/nvidiagpu/deploy-gpu-test.go (3)

44-45: LGTM! Improved worker node selector with constants.

The worker node selector now uses constants from the nvidiagpu package, improving maintainability.


187-201: LGTM! Improved NFD catalog source handling.

The code now properly handles NFD catalog source configuration using constants.


666-678: LGTM! Better deployment timing management.

The code now uses well-defined constants for deployment timing, making it easier to adjust timeouts.

2. fixed nfd consts names
3.fixed nfd var bug - create instance per package not global nfd
@AdamKaabyia
Copy link
Contributor Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Jan 25, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@AdamKaabyia
Copy link
Contributor Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Jan 25, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (2)
tests/nvidiagpu/deploy-gpu-test.go (2)

Line range hint 1186-1204: Extract NFD deployment configuration into a separate function.

The NFD deployment logic contains hardcoded values for sleep and timeout durations. Consider extracting these into configuration or constants.

+const (
+    nfdOperatorDeploymentWaitTime = 2 * time.Minute
+    nfdOperatorDeploymentTimeout = 5 * time.Minute
+    nfdOperatorDeploymentCheckInterval = 30 * time.Second
+)

 func createNFDDeployment() bool {
     By("Deploy NFD Subscription in NFD namespace")
     err := deploy.CreateNFDSubscription(inittools.APIClient, Nfd.CatalogSource)
     Expect(err).ToNot(HaveOccurred(), "error creating NFD Subscription:  %v", err)

-    By("Sleep for 2 minutes to allow the NFD Operator deployment to be created")
+    By(fmt.Sprintf("Sleep for %v to allow the NFD Operator deployment to be created", nfdOperatorDeploymentWaitTime))
     glog.V(gpuparams.GpuLogLevel).Infof("Sleep for 2 minutes to allow the NFD Operator deployment" +
         " to be created")
-    time.Sleep(2 * time.Minute)
+    time.Sleep(nfdOperatorDeploymentWaitTime)

Line range hint 1209-1234: Add error handling for pod deletion operations.

The deleteOLMPods function should handle potential errors more gracefully and provide better error messages.

 func deleteOLMPods(apiClient *clients.Settings) error {
+    const olmNamespace = "openshift-operator-lifecycle-manager"
+    
+    deleteOpts := metav1.DeleteOptions{}
+    listOpts := metav1.ListOptions{}
 
-    olmNamespace := "openshift-operator-lifecycle-manager"
     glog.V(gpuparams.GpuLogLevel).Info("Deleting catalog operator pods")
-    if err := apiClient.Pods(olmNamespace).DeleteCollection(context.TODO(),
-        metav1.DeleteOptions{},
-        metav1.ListOptions{LabelSelector: "app=catalog-operator"}); err != nil {
+    listOpts.LabelSelector = "app=catalog-operator"
+    if err := apiClient.Pods(olmNamespace).DeleteCollection(context.TODO(), deleteOpts, listOpts); err != nil {
+        return fmt.Errorf("failed to delete catalog operator pods: %w", err)
+    }
 
     glog.V(gpuparams.GpuLogLevel).Info("Deleting OLM operator pods")
-    if err := apiClient.Pods(olmNamespace).DeleteCollection(
-        context.TODO(),
-        metav1.DeleteOptions{},
-        metav1.ListOptions{LabelSelector: "app=olm-operator"}); err != nil {
+    listOpts.LabelSelector = "app=olm-operator"
+    if err := apiClient.Pods(olmNamespace).DeleteCollection(context.TODO(), deleteOpts, listOpts); err != nil {
+        return fmt.Errorf("failed to delete OLM operator pods: %w", err)
+    }
 
     return nil
 }
🧹 Nitpick comments (4)
pkg/nfd/vars.go (1)

5-11: Add documentation for the Config struct and its fields.

The Config struct and its fields would benefit from documentation comments explaining their purpose and expected values. This helps other developers understand how to use this configuration correctly.

+// Config holds the configuration settings for Node Feature Discovery (NFD)
 type Config struct {
+    // CustomCatalogSourceIndexImage specifies the index image for a custom catalog source
     CustomCatalogSourceIndexImage string
+    // CreateCustomCatalogsource indicates whether to create a custom catalog source
     CreateCustomCatalogsource     bool
+    // CustomCatalogSource specifies the name of the custom catalog source
     CustomCatalogSource           string
+    // CatalogSource specifies the name of the catalog source to use
     CatalogSource                 string
+    // CleanupAfterInstall indicates whether to clean up resources after installation
     CleanupAfterInstall           bool
 }
tests/nvidiagpu/deploy-gpu-test.go (3)

Line range hint 41-77: Consider grouping related variables and adding documentation.

The variable declarations could be better organized by grouping related variables together and adding documentation for each group. This improves code readability and maintainability.

+// NFD configuration
 var (
     Nfd = nfd.NewConfig()
+)

+// GPU operator configuration
 var (
     gpuInstallPlanApproval v1alpha1.Approval = "Automatic"
     gpuWorkerNodeSelector = map[string]string{
         inittools.GeneralConfig.WorkerLabel: "",
         nvidiagpu.NvidiaGPULabel:            "true",
     }
     // ... other GPU-related variables
 )

+// Machine configuration
 var (
     machineSetNamespace = "openshift-machine-api"
     replicas       int32 = 1
     workerMachineSetLabel = "machine.openshift.io/cluster-api-machine-role"
 )

667-679: Extract deployment wait logic into a helper function.

The deployment wait logic is repeated in multiple places. Consider extracting it into a helper function to reduce code duplication.

+func waitForDeploymentCreation(apiClient *clients.Settings, deploymentName, namespace string) error {
+    glog.V(gpuparams.GpuLogLevel).Infof("Wait for up to %s for %s deployment to be created", 
+        nvidiagpu.DeploymentCreationTimeout, deploymentName)
+    created := wait.DeploymentCreated(
+        apiClient,
+        deploymentName,
+        namespace,
+        nvidiagpu.DeploymentCreationCheckInterval,
+        nvidiagpu.DeploymentCreationTimeout)
+    if !created {
+        return fmt.Errorf("timed out waiting to deploy %s", deploymentName)
+    }
+    return nil
+}

915-925: Improve error messages with more context.

The error messages could be more descriptive by including relevant context such as timeout durations and expected states.

-Expect(err).ToNot(HaveOccurred(), "timeout waiting for gpu-burn pod in "+
-    "namespace '%s' to go to Running phase:  %v ", nvidiagpu.BurnNamespace, err)
+Expect(err).ToNot(HaveOccurred(), "timeout after %s waiting for gpu-burn pod in "+
+    "namespace '%s' to transition from %s to Running phase: %v", 
+    nvidiagpu.BurnPodRunningTimeout,
+    nvidiagpu.BurnNamespace,
+    gpuPodPulled.Definition.Status.Phase,
+    err)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f851da and b7aa1ea.

📒 Files selected for processing (5)
  • pkg/global/const.go (1 hunks)
  • pkg/nfd/consts.go (1 hunks)
  • pkg/nfd/vars.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (41 hunks)
  • tests/nvidianetwork/deploy-nno-test.go (17 hunks)
✅ Files skipped from review due to trivial changes (1)
  • pkg/global/const.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/nfd/consts.go
  • tests/nvidianetwork/deploy-nno-test.go
🔇 Additional comments (1)
pkg/nfd/vars.go (1)

13-21: LGTM! Constructor implementation looks good.

The NewConfig function properly initializes all fields with appropriate default values, using the UndefinedValue constant for string fields and false for boolean fields.

@AdamKaabyia
Copy link
Contributor Author

@coderabbitai full review

Copy link
Contributor

coderabbitai bot commented Jan 25, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
tests/nvidianetwork/deploy-nno-test.go (1)

Line range hint 344-363: Improve error handling in cleanup functions.

The cleanup code ignores errors which could mask issues during teardown.

Apply this diff to handle errors:

 if Nfd.CleanupAfterInstall && cleanupAfterTest {
-    _ = deploy.NFDCRDeleteAndWait(inittools.APIClient, nfd.CRName, nfd.OperatorNamespace, 30*time.Second,
+    if err := deploy.NFDCRDeleteAndWait(inittools.APIClient, nfd.CRName, nfd.OperatorNamespace, 30*time.Second,
         5*time.Minute)
+    if err != nil {
+        glog.Errorf("Failed to delete NFD CR: %v", err)
+    }

     By("Delete NFD CSV")
-    _ = deploy.DeleteNFDCSV(inittools.APIClient)
+    if err := deploy.DeleteNFDCSV(inittools.APIClient); err != nil {
+        glog.Errorf("Failed to delete NFD CSV: %v", err)
+    }
♻️ Duplicate comments (1)
tests/nvidiagpu/deploy-gpu-test.go (1)

256-261: 🛠️ Refactor suggestion

Use constants for sleep and wait durations.

The code uses magic numbers for sleep and wait durations. Consider defining these as named constants in the nvidiagpu package for better maintainability.

Also applies to: 667-679, 791-794, 915-925, 1149-1158

🧹 Nitpick comments (5)
pkg/nfd/consts.go (2)

1-17: Add package and constant documentation.

The constants would benefit from:

  1. Package documentation explaining the purpose and usage of these NFD-related constants
  2. Grouping related constants with comments (e.g., catalog source related, label related)
  3. Documentation for non-obvious constants like LogLevel

Apply this diff to improve documentation:

+// Package nfd provides constants and configuration for Node Feature Discovery (NFD)
 package nfd

 const (
+    // Catalog Source related constants
     CustomNFDCatalogSourcePublisherName = "Red Hat"
     CustomCatalogSourceDisplayName      = "Redhat Operators Custom"
+
+    // Node labeling related constants
     RhcosLabel                          = "feature.node.kubernetes.io/system-os_release.ID"
     RhcosLabelValue                     = "rhcos"
+
+    // Namespace and deployment related constants
     OperatorNamespace                   = "openshift-nfd"
     CatalogSourceDefault                = "redhat-operators"
     CatalogSourceNamespace              = "openshift-marketplace"
     OperatorDeploymentName              = "nfd-controller-manager"
     Package                             = "nfd"
     CRName                              = "nfd-instance"

+    // Resource related constants
     resourceCRD = "NodeFeatureDiscovery"
+    // LogLevel defines the verbosity level for logging (100 enables debug logs)
     LogLevel    = 100
 )

4-5: Consider consolidating duplicate publisher constants.

The publisher name "Red Hat" is duplicated in the nvidiagpu package. Consider moving common constants to a shared package.

Create a new shared package (e.g., pkg/shared/constants.go) for constants used across multiple packages.

Also applies to: 23-25

pkg/nfd/vars.go (1)

5-11: Add struct and field documentation.

The Config struct and its fields would benefit from documentation explaining their purpose and valid values.

Apply this diff to add documentation:

+// Config holds the configuration for Node Feature Discovery (NFD)
 type Config struct {
+    // CustomCatalogSourceIndexImage is the index image for custom catalog source
     CustomCatalogSourceIndexImage string
+    // CreateCustomCatalogsource indicates whether to create a custom catalog source
     CreateCustomCatalogsource     bool
+    // CustomCatalogSource is the name of the custom catalog source
     CustomCatalogSource           string
+    // CatalogSource is the name of the catalog source to use
     CatalogSource                 string
+    // CleanupAfterInstall indicates whether to clean up resources after installation
     CleanupAfterInstall           bool
 }
pkg/nvidiagpu/consts.go (1)

27-68: Consolidate and standardize time duration constants.

The time-related constants could be improved by:

  1. Using consistent units (minutes vs seconds)
  2. Consolidating similar durations
  3. Using more descriptive names for timeout constants

Apply this diff to standardize the durations:

-    SleepDuration = 30 * time.Second
-    WaitDuration = 4 * time.Minute
-    DeletionPollInterval = 30 * time.Second
+    // Common intervals
+    DefaultPollInterval = 30 * time.Second
+    DefaultWaitDuration = 4 * time.Minute
+    DefaultDeletionTimeout = 5 * time.Minute

-    BurnLogCollectionPeriod = 500 * time.Second
-    RedeployedBurnLogCollectionPeriod = 500 * time.Second
+    // Log collection periods (standardized to minutes)
+    BurnLogCollectionPeriod = 8 * time.Minute
+    RedeployedBurnLogCollectionPeriod = 8 * time.Minute
tests/nvidianetwork/deploy-nno-test.go (1)

Line range hint 1-817: Consider splitting the test file into smaller files.

The test file is quite long and handles multiple responsibilities. Consider splitting it into smaller files organized by test functionality:

  • NFD deployment tests
  • Network operator deployment tests
  • Cleanup utilities
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6350bc8 and b7aa1ea.

📒 Files selected for processing (7)
  • pkg/global/const.go (1 hunks)
  • pkg/nfd/consts.go (1 hunks)
  • pkg/nfd/nodefeaturediscovery.go (11 hunks)
  • pkg/nfd/vars.go (1 hunks)
  • pkg/nvidiagpu/consts.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (41 hunks)
  • tests/nvidianetwork/deploy-nno-test.go (17 hunks)
🔇 Additional comments (5)
pkg/global/const.go (1)

1-8: LGTM! Well-structured constants package.

The constants are well-documented and follow Go naming conventions. The package organization promotes code reuse by centralizing common constants.

pkg/nfd/nodefeaturediscovery.go (1)

Line range hint 33-258: LGTM! Consistent use of LogLevel constant.

The changes improve maintainability by replacing hardcoded log level values with a centralized constant. This makes it easier to adjust logging verbosity across the entire package.

tests/nvidiagpu/deploy-gpu-test.go (3)

12-15: LGTM! Well-organized imports.

The new imports are properly organized and use the new packages containing the extracted constants.


Line range hint 41-77: LGTM! Improved variable declarations using constants.

The changes properly utilize constants from the global package and initialize the Nfd configuration. This improves maintainability by centralizing constant values.


915-925: Consider using context with timeout for long-running operations.

Long-running operations like waiting for pod status should use context with timeout to ensure proper cleanup and resource management.

+ctx, cancel := context.WithTimeout(context.Background(), nvidiagpu.BurnPodRunningTimeout)
+defer cancel()
-err = gpuPodPulled.WaitUntilInStatus(corev1.PodRunning, nvidiagpu.BurnPodRunningTimeout)
+err = gpuPodPulled.WaitUntilInStatusWithContext(ctx, corev1.PodRunning)

Also applies to: 1149-1158

…urn test

* the gpu burn test passed on a single node cluster
@openshift-ci openshift-ci bot removed the lgtm label Feb 13, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/nvidiagpu/deploy-gpu-test.go (2)

259-260: Consider using proper synchronization instead of sleep.

Using time.Sleep for synchronization is not ideal. Consider using a polling mechanism with context and timeout.

-By(fmt.Sprintf("Sleep for %s to allow the NFD custom catalogsource to be created", nvidiagpu.SleepDuration.String()))
-time.Sleep(nvidiagpu.SleepDuration)
+ctx, cancel := context.WithTimeout(context.Background(), nvidiagpu.SleepDuration)
+defer cancel()
+err = wait.PollUntilContextCancel(ctx, time.Second, true, func(ctx context.Context) (bool, error) {
+    // Check if catalogsource exists and is initialized
+    return true, nil
+})

661-663: Replace sleep with proper wait mechanism.

Using time.Sleep for deployment synchronization is not reliable. Consider using a proper wait mechanism.

-By(fmt.Sprintf("Sleep for %s to allow the GPU Operator deployment to be created", nvidiagpu.OperatorDeploymentCreationDelay))
-glog.V(gpuparams.GpuLogLevel).Infof("Sleep for %s to allow the GPU Operator deployment to be created", nvidiagpu.OperatorDeploymentCreationDelay)
-time.Sleep(nvidiagpu.OperatorDeploymentCreationDelay)
+ctx, cancel := context.WithTimeout(context.Background(), nvidiagpu.OperatorDeploymentCreationDelay)
+defer cancel()
+err = wait.PollUntilContextCancel(ctx, time.Second, true, func(ctx context.Context) (bool, error) {
+    // Check if deployment exists
+    return true, nil
+})
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd2409c and ab36fbb.

📒 Files selected for processing (3)
  • pkg/nvidiagpu/consts.go (1 hunks)
  • pkg/nvidiagpu/gpuBurnConfig.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (40 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/nvidiagpu/consts.go
🔇 Additional comments (6)
pkg/nvidiagpu/gpuBurnConfig.go (2)

3-10: Well-structured configuration struct with clear documentation.

The GPUBurnConfig struct is well-organized, with clear field names and comprehensive documentation explaining its purpose and domain context.


12-22: LGTM! Constructor provides sensible defaults.

The NewDefaultGPUBurnConfig constructor is well-documented and provides reasonable default values for all configuration fields.

tests/nvidiagpu/deploy-gpu-test.go (4)

10-16: LGTM! Improved organization of imports.

The imports are well-organized, with new packages for NFD and GPU-related functionality properly imported.


41-80: Well-structured variable declarations using package constants.

The global variables are now properly using constants from dedicated packages, improving maintainability and consistency.


711-713: Improve error handling for report writing.

The error from WriteReport is only logged. Consider failing the test or at least adding a warning since this could indicate issues with reporting test results.


573-578: LGTM! Well-structured cleanup logic.

The cleanup operations are properly implemented using defer statements and respect the cleanup configuration flags.

Also applies to: 740-744, 842-847

* the gpu burn test passed on a single node cluster
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/nfd/config.go (2)

3-3: Avoid dot import for better code clarity.

Using dot imports (.) can lead to naming conflicts and make it harder to track where symbols are coming from. Consider using a regular import instead:

-import . "github.com/rh-ecosystem-edge/nvidia-ci/pkg/global"
+import "github.com/rh-ecosystem-edge/nvidia-ci/pkg/global"

Then update references to UndefinedValue to global.UndefinedValue.


5-11:

❓ Verification inconclusive

Add documentation for the struct and its fields.

The struct lacks documentation explaining its purpose and the role of each field. Consider adding comments:

+// CustomConfig holds configuration for NFD custom catalog sources
 type CustomConfig struct {
+    // CustomCatalogSourceIndexImage specifies the index image for the custom catalog source
     CustomCatalogSourceIndexImage string
+    // CreateCustomCatalogsource determines whether to create a custom catalog source
     CreateCustomCatalogsource     bool
+    // CustomCatalogSource specifies the name of the custom catalog source
     CustomCatalogSource           string
+    // CatalogSource specifies the name of the catalog source
     CatalogSource                 string
+    // CleanupAfterInstall determines whether to clean up resources after installation
     CleanupAfterInstall           bool
 }

Verify required fields and methods.

Based on the reported compilation issues in nvidianetwork tests, there might be missing fields or methods expected by the test suite. Please verify if additional fields or methods are needed.


🏁 Script executed:

#!/bin/bash
# Search for usage of nfd.Config type to identify expected fields and methods
rg -t go "nfd\.Config\." -A 5

Length of output: 31


Document CustomConfig and Verify Test Integration

Please add documentation to the CustomConfig struct to clarify its purpose and the role of each field. For example:

+// CustomConfig holds configuration for NFD custom catalog sources.
 type CustomConfig struct {
+    // CustomCatalogSourceIndexImage specifies the index image for the custom catalog source.
     CustomCatalogSourceIndexImage string
+    // CreateCustomCatalogsource determines whether to create a custom catalog source.
     CreateCustomCatalogsource     bool
+    // CustomCatalogSource specifies the custom catalog source's name.
     CustomCatalogSource           string
+    // CatalogSource specifies the catalog source's name.
     CatalogSource                 string
+    // CleanupAfterInstall indicates whether to clean up resources after installation.
     CleanupAfterInstall           bool
 }

In addition, the reported compilation issues in the nvidianetwork tests suggest that the test suite might be expecting additional fields or methods. Since our initial automated search for references to nfd.Config produced no output, please manually verify whether:

  • Additional fields or methods are required on this struct (or its related types) to meet the expectations of the tests.
  • The refactoring to consolidate NFD constants has not inadvertently removed or altered necessary functionality.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab36fbb and c47b3f8.

📒 Files selected for processing (1)
  • pkg/nfd/config.go (1 hunks)

AdamKaabyia and others added 4 commits February 14, 2025 12:51
…ncerns)

pkg/nfd: Manages NFD resource creation and verification

pkg/nfdcheck: Verifies that NFD is correctly installed and configured

* the gpu burn test passed on a single node cluster
added comments

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…nternal from pkg

added these funcs:
EnsureNFDIsInstalled
DeployNFDOperatorWithRetries

1.gpu-used the new funcs here
2.nno - havent added them here yet

* the gpu burn test passed on a single node cluster

DeleteOLMPods- needs to be put in a diffrent package its currently in nfd pkg
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
pkg/nfd/deploy-nfd.go (1)

411-412: ⚠️ Potential issue

Use Infof(...) instead of Info(...) for format strings.
You are passing format specifiers ("%s") and multiple arguments into glog.Info, which does not interpret them as format strings. Use Infof(...) to properly format the log message.

-glog.V(gpuparams.GpuLogLevel).Info("Deleting NFD Subscription '%s' in namespace '%s'", nfdSubscriptionName, nfdOperatorNamespace)
+glog.V(gpuparams.GpuLogLevel).Infof("Deleting NFD Subscription '%s' in namespace '%s'", nfdSubscriptionName, nfdOperatorNamespace)
🧹 Nitpick comments (8)
pkg/nfd/utils.go (2)

88-90: Clarify the test skip scenario.
The test calls Skip(...) if the nfd package manifest does not exist in the default catalog and no custom catalogsource is allowed. Ensure that skipping here is an intentional test expectation rather than leaving the environment partially tested.


101-105: Remove or refine commented-out code.
The comment referencing "I don't need to pass fd.OperatorDeploymentName…" appears outdated or unclear. Removing or clarifying it would improve maintainability and readability.

pkg/nfd/config.go (1)

30-58: Optional extension of documentation for NfdParams struct.
Consider adding explanations or examples of typical usage for each field (e.g., clarifying how the operator channel is selected). This can help future contributors quickly understand how to configure the NFD operator.

pkg/nfd/deploy-nfd.go (1)

483-485: Replace the fixed 2-minute sleep with a smarter wait approach.
Consider replacing the unconditional time.Sleep(2 * time.Minute) with a loop that checks operator status at intervals. This approach can reduce test runtime when the operator is ready sooner and still accommodate slower setups.

tests/nvidiagpu/deploy-gpu-test.go (4)

48-51: Consider using a constant for the worker node selector map.

Since the WorkerNodeSelector map is used for configuration and doesn't change during runtime, consider making it a constant using map[string]string literal.

-WorkerNodeSelector = map[string]string{
+const WorkerNodeSelector = map[string]string{

219-220: Remove TODO comments.

The TODO comment about passing Nfd instance by pointer should be addressed or removed.

Would you like me to help implement the pointer-based approach for the Nfd instance?


327-330: Consider using context for sleep operations.

Instead of using time.Sleep, consider using context.WithTimeout for better control and testability.

-if ScaleCluster {
-  glog.V(gpuparams.GpuLogLevel).Infof("Sleeping for %s to allow the newly created GPU worker node to be labeled by NFD", nvidiagpu.NodeLabelingDelay.String())
-  time.Sleep(nvidiagpu.NodeLabelingDelay)
+if ScaleCluster {
+  ctx, cancel := context.WithTimeout(context.Background(), nvidiagpu.NodeLabelingDelay)
+  defer cancel()
+  glog.V(gpuparams.GpuLogLevel).Infof("Waiting for %s to allow the newly created GPU worker node to be labeled by NFD", nvidiagpu.NodeLabelingDelay.String())
+  <-ctx.Done()

792-798: Improve cleanup condition readability.

The cleanup condition could be more readable by extracting the condition into a named variable.

+shouldCleanup := cleanupAfterTest && OperatorUpgradeToChannel == UndefinedValue
 defer func() {
-  if cleanupAfterTest && OperatorUpgradeToChannel == UndefinedValue {
+  if shouldCleanup {
     _, err := gpuPodPulled.Delete()
     Expect(err).ToNot(HaveOccurred())
   }
 }()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f186566 and eadfb14.

📒 Files selected for processing (7)
  • internal/nvidiagpuconfig/config.go (1 hunks)
  • internal/nvidianetworkconfig/config.go (1 hunks)
  • pkg/nfd/config.go (1 hunks)
  • pkg/nfd/deploy-nfd.go (7 hunks)
  • pkg/nfd/utils.go (1 hunks)
  • tests/nvidiagpu/deploy-gpu-test.go (28 hunks)
  • tests/nvidianetwork/deploy-nno-test.go (20 hunks)
✅ Files skipped from review due to trivial changes (2)
  • internal/nvidiagpuconfig/config.go
  • internal/nvidianetworkconfig/config.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/nvidianetwork/deploy-nno-test.go
🔇 Additional comments (4)
pkg/nfd/utils.go (1)

29-35: Consider handling errors distinctly from the “not installed” scenario.
Currently, an error from check.NFDDeploymentsReady(apiClient) is treated the same as “NFD not installed.” If the error is transient (e.g., network issue), you might inadvertently redeploy NFD. Consider differentiating between a genuine “not installed” state and an unexpected error.

pkg/nfd/config.go (1)

17-19: Documentation is well-structured.
The docstring for NewCustomConfig clearly explains default values for each field, which is helpful for maintainability. Good job!

tests/nvidiagpu/deploy-gpu-test.go (2)

43-44: LGTM! Good use of constructor functions.

The initialization of Nfd and burn using constructor functions improves code organization and maintainability.


10-17:

❓ Verification inconclusive

Verify the necessity of blank import.

The blank import of the clients package (_ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients") suggests side effects are being used. This is generally not recommended unless absolutely necessary.


🏁 Script executed:

#!/bin/bash
# Check if the clients package has init() functions or other side effects
rg -A 5 "func init\(\)" pkg/clients/

Length of output: 38


Blank Import Necessity Verification

Based on the search for init() functions in the pkg/clients package, no evidence of explicit side-effect code (e.g., an init() method) was detected. This suggests that the blank import in tests/nvidiagpu/deploy-gpu-test.go:

_ "github.com/rh-ecosystem-edge/nvidia-ci/pkg/clients"

may not be triggering any initialization side effects. If the blank import was originally intended to cause side effects via implicit registration or variable initialization, please confirm that such behavior is indeed occurring. Otherwise, consider removing this import for clarity and maintainability, or add a comment clarifying its necessity.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/nvidianetwork/deploy-nno-test.go (1)

252-252: Consider handling potential errors from NFD installation check.

The CheckNfdInstallation function call doesn't handle potential errors. Consider capturing and asserting the error to ensure proper test behavior.

-nfdcheck.CheckNfdInstallation(inittools.APIClient, nfd.RhcosLabel, nfd.RhcosLabelValue, inittools.GeneralConfig.WorkerLabelMap, networkparams.LogLevel)
+err := nfdcheck.CheckNfdInstallation(inittools.APIClient, nfd.RhcosLabel, nfd.RhcosLabelValue, inittools.GeneralConfig.WorkerLabelMap, networkparams.LogLevel)
+Expect(err).ToNot(HaveOccurred(), "error checking NFD installation: %v", err)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eadfb14 and ba84157.

📒 Files selected for processing (1)
  • tests/nvidianetwork/deploy-nno-test.go (17 hunks)
🔇 Additional comments (3)
tests/nvidianetwork/deploy-nno-test.go (3)

7-8: LGTM! Import changes align with package restructuring.

The new imports correctly reflect the reorganization of NFD-related code into dedicated packages.

Also applies to: 13-15


32-32: LGTM! NFD configuration centralization.

Centralizing NFD configuration through a dedicated struct improves code organization and maintainability.


214-214: LGTM! Consistent use of NFD package for setup and cleanup.

The test setup and cleanup now consistently use the NFD package functions, improving code organization.

Also applies to: 228-246

Copy link
Collaborator

@wabouhamad wabouhamad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm. Tested NNO testcase successfully from PR branch

@@ -16,7 +16,7 @@ type NvidiaGPUConfig struct {
BundleImage string `envconfig:"NVIDIAGPU_BUNDLE_IMAGE"`
OperatorUpgradeToChannel string `envconfig:"NVIDIAGPU_SUBSCRIPTION_UPGRADE_TO_CHANNEL"`
GPUFallbackCatalogsourceIndexImage string `envconfig:"NVIDIAGPU_GPU_FALLBACK_CATALOGSOURCE_INDEX_IMAGE"`
NFDFallbackCatalogsourceIndexImage string `envconfig:"NVIDIAGPU_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE"`
NFDFallbackCatalogsourceIndexImage string `envconfig:"NVIDIAGPU_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE"` // should this be here?
Copy link
Member

@empovit empovit Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it shouldn't. It belongs to the NFD package and shouldn't be specific to GPU, must cover any scenario that involves NFD without ambiguities - NFD + GPU, NFD + NNO, NFD + GPU + NNO. Which means we'll have a single NFD_FALLBACK_CATALOGSOURCEC_INDEX_IMAGE instead of NVIDIAGPU_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE and NVIDIANETWORK_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE.
For backward compatibility, we may have a translation NVIDIAGPU_NFD_FALLBACK_CATALOGSOURCE_INDEX_IMAGE > NFD_FALLBACK_CATALOGSOURCEC_INDEX_IMAGE (and a similar one for NNO) until the CI jobs are fixed.

Comment on lines +8 to +28
type CustomConfig struct {
CustomCatalogSourceIndexImage string
CreateCustomCatalogsource bool

CustomCatalogSource string
CatalogSource string
CleanupAfterInstall bool
}

// NewCustomConfig creates a new CustomConfig instance with default settings.
// All string fields are initialized to UndefinedValue and boolean fields to false.
func NewCustomConfig() *CustomConfig {
return &CustomConfig{
CustomCatalogSourceIndexImage: UndefinedValue,
CreateCustomCatalogsource: false,

CustomCatalogSource: UndefinedValue,
CatalogSource: UndefinedValue,
CleanupAfterInstall: false,
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this can be shared by all operators that might need a custom catalog source, not limited to NFD.


// NfdParams holds all the configuration details required to install or manage
// the Node Feature Discovery (NFD) operator on a cluster.
type NfdParams struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this has to be NFDParams according to Go conventions.

Comment on lines +234 to 251
if Nfd.CleanupAfterInstall && cleanupAfterTest {
// Here need to check if NFD CR is deployed, otherwise Deleting a non-existing CR will throw an error
// skipping error check for now cause any failure before entire NFD stack
By("Delete NFD CR instance in NFD namespace")
_ = deploy.NFDCRDeleteAndWait(inittools.APIClient, nfdCRName, nfdOperatorNamespace, 30*time.Second, 5*time.Minute)
_ = nfd.NFDCRDeleteAndWait(inittools.APIClient, nfd.CRName, nfd.OperatorNamespace, nvidiagpu.DeletionPollInterval, nvidiagpu.DeletionTimeoutDuration)

By("Delete NFD CSV")
_ = deploy.DeleteNFDCSV(inittools.APIClient)
_ = nfd.DeleteNFDCSV(inittools.APIClient)

By("Delete NFD Subscription in NFD namespace")
_ = deploy.DeleteNFDSubscription(inittools.APIClient)
_ = nfd.DeleteNFDSubscription(inittools.APIClient)

By("Delete NFD OperatorGroup in NFD namespace")
_ = deploy.DeleteNFDOperatorGroup(inittools.APIClient)
_ = nfd.DeleteNFDOperatorGroup(inittools.APIClient)

By("Delete NFD Namespace in NFD namespace")
_ = deploy.DeleteNFDNamespace(inittools.APIClient)
_ = nfd.DeleteNFDNamespace(inittools.APIClient)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this functionality should be in the NFD package, GPU/NNO tests shouldn't care what a cleanup means for NFD, and will just "ask" it to clean up itself

@empovit
Copy link
Member

empovit commented Feb 24, 2025

/lgtm with comments
/approve

Copy link

openshift-ci bot commented Feb 24, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AdamKaabyia, empovit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@empovit
Copy link
Member

empovit commented Feb 24, 2025

@AdamKaabyia do we want to merge it into main?

@AdamKaabyia
Copy link
Contributor Author

@AdamKaabyia do we want to merge it into main?

no, for now only in package-refactoring

@empovit
Copy link
Member

empovit commented Feb 24, 2025

/ok-to-test

@empovit
Copy link
Member

empovit commented Feb 24, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Feb 24, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 7d2847d into rh-ecosystem-edge:package-refactoring Feb 24, 2025
2 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants