-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(resize): delegate RWX filesystem resize to share-manager #3267
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThe changes in this pull request introduce enhancements to the Longhorn CSI plugin, specifically for handling ReadWriteMany (RWX) volume expansion. A new environment variable Changes
Assessment against linked issues
Possibly related PRs
📜 Recent review detailsConfiguration used: .coderabbit.yaml ⛔ Files ignored due to path filters (6)
📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
🔇 Additional comments (3)csi/node_server.go (3)
The implementation is clean and follows good practices:
The changes properly integrate RWX volume expansion by:
Since multiple nodes can trigger expansion simultaneously, let's verify the handling of concurrent requests. ✅ Verification successfulConcurrent expansion requests are safely handled by design The current implementation is safe for concurrent expansion requests because:
🏁 Scripts executedThe following scripts were executed for the analysis: Script: #!/bin/bash
# Description: Check for potential race conditions in volume expansion
# Look for any error handling or synchronization mechanisms around FilesystemResize
rg -A 5 "FilesystemResize"
# Check for any existing issue reports about concurrent expansion
gh issue list --search "RWX concurrent expansion"
Length of output: 11257 Script: #!/bin/bash
# Let's check the share manager server implementation to understand how concurrent FilesystemResize requests are handled
# Search for the server-side implementation of FilesystemResize
ast-grep --pattern 'func $_(ctx context.Context, _ *emptypb.Empty) (*emptypb.Empty, error) {
$$$
FilesystemResize
$$$
}'
# Look for any locking mechanisms or synchronization in the share manager
rg -A 5 "sync\.Mutex|sync\.RWMutex" vendor/github.com/longhorn/longhorn-share-manager/
Length of output: 217 Script: #!/bin/bash
# Let's check the share manager implementation in a different way
# Look for the share manager service implementation
rg -g "!vendor/*" -g "!*.pb.go" -A 10 "type.*ShareManager.*Service.*struct"
# Check for any filesystem resize implementation
rg -g "!vendor/*" -g "!*.pb.go" -A 10 "func.*[Ff]ilesystem[Rr]esize"
# Look for any resize-related locks or synchronization
rg -g "!vendor/*" "sync\.(Mutex|RWMutex).*(resize|expand|filesystem)"
Length of output: 773 Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
8a2ae97
to
077ce60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
engineapi/share_manager.go (1)
42-44
: Consider enhancing error handling and documentationWhile the implementation is functionally correct, consider these improvements:
- Wrap the error with context for better debugging
- Add method documentation explaining its purpose and usage
Consider applying this enhancement:
+// FilesystemResize triggers an online filesystem resize operation through the share-manager. +// This is specifically used for expanding ReadWriteMany (RWX) volumes. func (c *ShareManagerClient) FilesystemResize() error { - return c.grpcClient.FilesystemResize() + if err := c.grpcClient.FilesystemResize(); err != nil { + return errors.Wrap(err, "failed to resize filesystem via share-manager") + } + return nil }csi/node_server.go (2)
708-722
: Consider reusing Kubernetes clients to enhance performanceIn the
NodeExpandSharedVolume
function, Kubernetes clients (kubeClient
andlhClient
) are created on each invocation. This can introduce unnecessary overhead. Consider initializing these clients once during theNodeServer
initialization and storing them in theNodeServer
struct for reuse.
723-733
: Usecontext.Background()
instead ofcontext.TODO()
When making API calls to Kubernetes resources, it's preferable to use
context.Background()
as the base context in production code.Apply this diff:
- sm, err := lhClient.LonghornV1beta2().ShareManagers(lhNamespace).Get(context.TODO(), volumeName, metav1.GetOptions{}) + sm, err := lhClient.LonghornV1beta2().ShareManagers(lhNamespace).Get(context.Background(), volumeName, metav1.GetOptions{}) - pod, err := kubeClient.CoreV1().Pods(lhNamespace).Get(context.TODO(), podName, metav1.GetOptions{}) + pod, err := kubeClient.CoreV1().Pods(lhNamespace).Get(context.Background(), podName, metav1.GetOptions{})
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (6)
go.mod
is excluded by!go.mod
go.sum
is excluded by!**/*.sum
,!go.sum
vendor/github.com/longhorn/longhorn-share-manager/pkg/client/share_manager_client.go
is excluded by!vendor/**
vendor/github.com/longhorn/types/pkg/generated/smrpc/smrpc.pb.go
is excluded by!**/*.pb.go
,!**/generated/**
,!vendor/**
vendor/github.com/longhorn/types/pkg/generated/smrpc/smrpc_grpc.pb.go
is excluded by!**/*.pb.go
,!**/generated/**
,!vendor/**
vendor/modules.txt
is excluded by!vendor/**
📒 Files selected for processing (3)
csi/deployment.go
(1 hunks)csi/node_server.go
(3 hunks)engineapi/share_manager.go
(1 hunks)
🔇 Additional comments (2)
engineapi/share_manager.go (1)
42-44
: Implementation aligns with PR objectives
The new FilesystemResize
method correctly delegates the resize operation to the share-manager via gRPC, which addresses the core issue of RWX PVC filesystem resize failures.
csi/deployment.go (1)
421-428
: LGTM! The POD_NAMESPACE environment variable is correctly implemented.
The addition of the POD_NAMESPACE
environment variable is well-structured and necessary for enabling the CSI plugin to interact with share-manager pods during RWX volume resize operations.
Let's verify the integration with share-manager:
✅ Verification successful
The POD_NAMESPACE environment variable is correctly implemented and used for RWX volume resizing
The verification results confirm that:
- The
POD_NAMESPACE
environment variable is used inNodeExpandSharedVolume
to get the namespace for share-manager pod communication - The implementation correctly uses the environment variable via
types.EnvPodNamespace
- The share-manager integration is properly implemented with pod naming conventions and namespace-aware operations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that share-manager pods are deployed in the same namespace
# and the CSI plugin can access them for resize operations.
# Check if share-manager pods exist in the same namespace as CSI plugin
rg -A 5 "share-manager.*Pod"
# Verify the usage of POD_NAMESPACE in resize-related code
ast-grep --pattern 'NodeExpandSharedVolume($$$) {
$$$
POD_NAMESPACE
$$$
}'
Length of output: 113
Script:
#!/bin/bash
# Let's try a different approach to verify the integration
# Check for share-manager related code
rg -g '*.go' -B 2 -A 2 'share-manager'
# Look for NodeExpandSharedVolume implementation
rg -g '*.go' -B 2 -A 5 'NodeExpandSharedVolume'
# Search for POD_NAMESPACE usage in the codebase
rg -g '*.go' 'POD_NAMESPACE'
# Check for share-manager pod related code
rg -g '*.go' 'ShareManagerPod'
Length of output: 17106
This pull request is now in conflict. Could you fix it @james-munson? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns:
- As we mentioned in the sync up, we should double check network policy for the new connection from Longhorn CSI plugin to share manager
- With this design, we can only finish resizing when the RWX PVC is attached, right? Because only when RWX PVC is attached and used by workload then we have share manager pod. What about offline RWX PVC expansion. Do you have any idea?
lhNamespace := os.Getenv(types.EnvPodNamespace) | ||
if lhNamespace == "" { | ||
return fmt.Errorf("failed to detect pod namespace, environment variable %v is missing", types.EnvPodNamespace) | ||
} | ||
|
||
config, err := rest.InClusterConfig() | ||
if err != nil { | ||
return errors.Wrap(err, "failed to get client config") | ||
} | ||
|
||
kubeClient, err := clientset.NewForConfig(config) | ||
if err != nil { | ||
return errors.Wrap(err, "failed to get k8s client") | ||
} | ||
|
||
lhClient, err := lhclientset.NewForConfig(config) | ||
if err != nil { | ||
return errors.Wrap(err, "failed to get longhorn clientset") | ||
} | ||
|
||
sm, err := lhClient.LonghornV1beta2().ShareManagers(lhNamespace).Get(context.TODO(), volumeName, metav1.GetOptions{}) | ||
if err != nil { | ||
return errors.Wrap(err, "failed to get ShareManager CR") | ||
} | ||
|
||
podName := types.GetShareManagerPodNameFromShareManagerName(sm.Name) | ||
pod, err := kubeClient.CoreV1().Pods(lhNamespace).Get(context.TODO(), podName, metav1.GetOptions{}) | ||
if err != nil { | ||
return errors.Wrap(err, "failed to get ShareManager pod") | ||
} | ||
|
||
client, err := engineapi.NewShareManagerClient(sm, pod) | ||
if err != nil { | ||
return errors.Wrapf(err, "failed to launch gRPC client for share manager before resizing volume %v", volumeName) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get IP of share manager service by accessing volume.ShareEndpoint
without having to look up the share manager pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it is not the same IP address, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, volume.ShareEndpoint
is the IP of the service which should eventually forward to the pod
{ | ||
Name: "POD_NAMESPACE", | ||
ValueFrom: &corev1.EnvVarSource{ | ||
FieldRef: &corev1.ObjectFieldSelector{ | ||
FieldPath: "metadata.namespace", | ||
}, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need to add namespace if we don't need to look up the share manager pod as mentioned below
Checking on the first. My test cluster has no network policies, except for a cattle-fleet-system one that allows all in its namespace. For the second, offline expansion works like it formerly did. The resize will happen when the share-manager pod is started and mounts the volume. That's why the code for resize was already there, https://github.com/longhorn/longhorn-share-manager/blob/5b2d717e3ca35a727fbaa628af478af4da535fcf/pkg/server/share_manager.go#L143 |
Thank you
So in the meantime, the PVC will stuck in pending expansion until the workload start? Is it same behavior as RWO PVC? |
As we resolved, there is a "pending expansion" in either case, because the filesystem grow is accomplished in NodeStageVolume. However, there is no lingering Condition about it. There already is an automated test case for offline expansion, since that was the only way to do it automatically. There's just not one for online, since it had a manual step. |
Signed-off-by: James Munson <[email protected]>
077ce60
to
1dcd787
Compare
For the point about network policy, we agreed that it does not look to be necessary for this functionality. Better to keep things simple. |
This pull request is now in conflict. Could you fix it @james-munson? 🙏 |
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9736 aka longhorn/longhorn#8118
What this PR does / why we need it:
Special notes for your reviewer:
Additional documentation or context