-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node-agent high memory usage #8582
Comments
The recommended configuration is to set as "no limit", or Best Efforts. |
Another recommendation is to use data mover backup/restore over fs-backup:
|
How would removing the memory limit impact the system as a whole. Could this mean that the node goes OOM when Velero node agent uses too much memory? The reason we would like to set memory limits is because we are running on a resource constrained system and we would like to avoid impact on other services. |
That depends on the complexity and scale of the data being backed up. Most probably, the node memory will not run out, but the system cache will be reclaimed when the memory is tight. However, this would also impact the running of other workload in the same node. |
If so, data mover is also recommended, because you could customize which nodes the data mover should/should not run, but you cannot do this for fs-backup. |
The volume that causes the most issues seems to be a volume that contains around 1TB of small files, between 500K and 5M in size. |
That isn't surprising. Worst case scenario of deduplication-based backup and restore software such as restic and kopia used in Velero is a large number of small files. Red Hat's recommendation for a normal sized config is 16GB request, 32GB limit for restic. Your usage requirements are in line with expectations. Kopia usually uses less resources in Velero 1.15 compared to restic. Will require new backups as kopia and restic repositories are not compatible and have no migration path. You can check which is use by checking the BackupRepository object. |
Out of curiosity, I have been playing around with GOMEMLIMIT and GOGC to see if I can lower the overall memory usage. I have been able to lower the peak usage from 25G to 15G. After a restore is done, the go runtime keeps 8G, which is higher than expected. |
If the repository is restic then setting the parallelFilesUpload might improve the memory usage at the cost of performance. Kopia is set to do parallel uploads equal to the number of cpus. Lowering parallel data upload streams would lower the memory usage at the cost of performance. I don't know if setting a cpu limit would change that reporting. @Lyndon-Li I don't suppose you've tested that. A brief check using nproc got the same result to report the number of cpu cores regardless of cpu request and limit. That isn't necessarily the same as go. |
No, the number CPU got from Golang is always the number of CPU cores in the node, CPU limit of cgroup doesn't affect the number. So always use the backup parameter |
@RobKenis |
@Lyndon-Li I am testing on a system 16 cores and 128GB of memory
|
The option --parallel-files-upload doesn't affect Kopia, only Restic. https://github.com/vmware-tanzu/velero/blob/release-1.15/pkg/uploader/kopia/snapshot.go#L100 Should that be a separate issue to have this option apply to Kopia? |
No, to the opposite, it works for Kopia path only, velero/pkg/uploader/kopia/snapshot.go Line 122 in 804d73c
|
Then the default concurrency in your env is 16. So here are the recommendation all in all:
|
@Lyndon-Li I understand the need for Data mover, this would resolve a big part of the problem. From what I understand, this requires a CSI driver to create Volume Snapshots. Is this also a possible solution when using Local Volumes as we don't use a CSI Driver? |
@Lyndon-Li Thanks for the correction and useful to know. @RobKenis As it means you can set this value below the cpu count of your system and it should reduce the memory usage if using Kopia at the cost of performance. |
This allows us to enable the profiler endpoints on both the server and the node agent. This helps me in troubleshooting the high memory usage when restoring lots of small files. Refs: vmware-tanzu#8582
This allows us to enable the profiler endpoints on both the server and the node agent. This helps me in troubleshooting the high memory usage when restoring lots of small files. Refs: vmware-tanzu#8582 Signed-off-by: Rob Kenis <[email protected]>
@Lyndon-Li @msfrucht I lower the amount of parallel files using the following config in the
This makes the restore a lot slower, but memory still rises to a high amount. I would like to get more insights into this, but it seems I cannot enable profiling endpoints on the node agent, only the velero server. Could you please have a look at this PR #8618 to enable profiling on the node agent? |
See this comment #8582 (comment). System cache takes lots of memory during fs-uploader read/write files. The cache memory won't be aggressively reclaimed as long as there are enough memory in the node, even after the backup/restore completes (since you are using fs-backup and node-agent). See these recommendations #8582 (comment) for the solution. |
This allows us to enable the profiler endpoints on both the server and the node agent. This helps me in troubleshooting the high memory usage when restoring lots of small files. Refs: vmware-tanzu#8582 Signed-off-by: Rob Kenis <[email protected]>
What steps did you take and what happened:
Velero is installed with nodeAgent enabled. Backup storage location is configured to use Azure Blob Storage.
The memory request for node-agent is set to 5Gi. Normally, the limit is also set to 5Gi. To avoid the Pod getting OOMKilled, I have removed the limit for this restore.
What did you expect to happen:
Memory usage stays around, but below, 5Gi.
Anything else you would like to add:
Environment:
velero version
): v1.15.0velero client config get features
):kubectl version
): v1.28.2+k3s1/etc/os-release
): AlmaLinux 9.5 (Teal Serval)Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: