-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDI implementation #489
CDI implementation #489
Conversation
bde3b1a
to
a417105
Compare
Pull Request Test Coverage Report for Build 5313388064Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note that device plugin daemonset would need to mount cdi dir
(/var/run/cdi) IIRC
b8caf78
to
cefd97b
Compare
cmd/sriovdp/manager.go
Outdated
rPool, err := rm.rFactory.GetResourcePool(rc, filteredDevices) | ||
if err != nil { | ||
glog.Errorf("initServers(): error creating ResourcePool with config %+v: %q", rc, err) | ||
return err | ||
} | ||
|
||
if rm.useCdi { | ||
err = cdi.CreateCDISpec(rm.resourcePrefix, filteredDevices, rPool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you regenerated it because the logic is in device plugin and you don't want to duplicate it
name: default-cdi | ||
- mountPath: /var/run/cdi | ||
name: dynamic-cdi | ||
- mountPath: /host//etc/pcidp/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra /
pkg/cdi/cdi.go
Outdated
func CreateCDISpec(resourcePrefix string, filteredDevices []types.HostDevice, rPool types.ResourcePool) error { | ||
cdiDevices := make([]cdiSpecs.Device, 0) | ||
cdiSpec := cdiSpecs.Spec{ | ||
Version: "0.5.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use constant for version in https://github.com/container-orchestrated-devices/container-device-interface/blob/v0.5.0/specs-go/config.go#L6
pkg/cdi/cdi.go
Outdated
deviceNode := cdiSpecs.DeviceNode{ | ||
Path: spec.ContainerPath, | ||
HostPath: spec.HostPath, | ||
Permissions: "rwm", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you check if the m in Permissions is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't needed for VFs. I'll check with SFs and create a separate PR to fix it over the whole project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check with and without RDMA CM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I checked both for VFs
8c38780
to
87716ff
Compare
```yaml | ||
- mountPath: /etc/cdi/ | ||
name: default-cdi | ||
- mountPath: /var/run/cdi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need just /var/run/cdi imo we dont deal with statically configured files right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should align daemonsets as well
@@ -4,6 +4,7 @@ go 1.20 | |||
|
|||
require ( | |||
github.com/Mellanox/rdmamap v1.1.0 | |||
github.com/container-orchestrated-devices/container-device-interface v0.5.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to check if we prefer to:
- bump this to v0.6.0 and require containerd 1.7.5 or cri-o 1.28 or keep as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://github.com/containerd/containerd/blob/v1.7.5/go.mod#L10, we need to keep v0.5.4 now
pkg/resources/server.go
Outdated
@@ -324,6 +365,9 @@ func (rs *resourceServer) cleanUp() error { | |||
if err := rs.resourcePool.CleanDeviceInfoFile(rs.resourceNamePrefix); err != nil { | |||
errors = append(errors, err.Error()) | |||
} | |||
if err := rs.cdi.CleanupSpecs(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should:
- cleanup spec files which are related only to this resource server her
- in the resource manager we should cleanup any remaining cdi spec files before starting servers just in case on un-gracefull exit of previous device plugin instance.
thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it's better to move this logic to resource manager only: it will cleanup all spec files before resorce servers start, so no raise condidion in this case and all outdated specs will be removed. With such behaviour we don't need to filter for orphaned spec files if config changed during device plugin restart
// Impl implements CDI interface | ||
type Impl struct { | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : define a New() method
func New() CDI {
return &impl{}
}
and use it in resource server
also consider making Impl struct private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// CreateCDISpecForPool creates CDI spec file with specified devices | ||
func (c *Impl) CreateCDISpecForPool(resourcePrefix string, rPool types.ResourcePool) error { | ||
err := c.CleanupSpecs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will delete spec files of other resource servers. see other comments related to where/how we should clean up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please loot at #489 (comment)
func (c *Impl) CleanupSpecs() error { | ||
for _, dir := range cdi.GetRegistry().GetSpecDirectories() { | ||
specs, err := filepath.Glob(filepath.Join(dir, cdiSpecPrefix+"*")) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this one fail if dir doesnt exist ?
if yes, then for such error we need to skip, else disregard my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glob returns the names of all files matching pattern or nil if there is no matching file. The syntax of patterns is the same as in Match. The pattern may describe hierarchical names such as /usr/*/bin/ed (assuming the Separator is '/').
Glob ignores file system errors such as I/O errors reading directories. The only possible returned error is ErrBadPattern, when pattern is malformed.
pkg/cdi/cdi_test.go
Outdated
annoKey := "cdi.k8s.io/example.com_net" | ||
annoVal := "example.com/net=0000:00:00.2" | ||
Expect(annotations[annoKey]).To(Equal(annoVal)) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add test for cleanup method as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final comments on my side, once addressed im LGTM.
48e6fba
to
a968db8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work!
I left some comments
cmd/sriovdp/main.go
Outdated
func flagInit(cp *cliParams) { | ||
flag.StringVar(&cp.configFile, "config-file", defaultConfig, | ||
"JSON device pool config file location") | ||
flag.StringVar(&cp.resourcePrefix, "resource-prefix", "intel.com", | ||
"resource name prefix used for K8s extended resource") | ||
"resource name prefix used for K8s extended re"+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why a new line here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
cmd/sriovdp/main.go
Outdated
@@ -43,7 +46,7 @@ func main() { | |||
|
|||
glog.Infof("resource manager reading configs") | |||
if err := rm.readConfig(); err != nil { | |||
glog.Errorf("error getting resources from file %v", err) | |||
glog.Error("error getting resources from file", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we should have the errorf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cmd/sriovdp/manager.go
Outdated
@@ -101,12 +109,16 @@ func (rm *resourceManager) readConfig() error { | |||
} | |||
|
|||
func (rm *resourceManager) initServers() error { | |||
err := rm.cleanupCDISpecs() | |||
if err != nil { | |||
glog.Infof("Unable to delete CDI specs: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be error here if we return an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cmd/sriovdp/manager.go
Outdated
func (rm *resourceManager) cleanupCDISpecs() error { | ||
if rm.cliParams.useCdi { | ||
if err := rm.cdi.CleanupSpecs(); err != nil { | ||
return fmt.Errorf("unable to delete CDI specs: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be %v here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -23,6 +23,10 @@ import ( | |||
"github.com/k8snetworkplumbingwg/sriov-network-device-plugin/pkg/types" | |||
) | |||
|
|||
const ( | |||
accelPoolType = "net-accel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we add net-
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it for the consistency across all device types
pkg/cdi/cdi.go
Outdated
func (c *impl) CreateCDISpecForPool(resourcePrefix string, rPool types.ResourcePool) error { | ||
err := c.CleanupSpecs() | ||
if err != nil { | ||
glog.Error("can not cleanup old spec files", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add CreateCDISpecForPool():
so we can be consistent in the logs or remove it from the log below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/cdi/cdi_test.go
Outdated
annoVal := "example.com/net=0000:00:00.1" | ||
Expect(annotations[annoKey]).To(Equal(annoVal)) | ||
}) | ||
It("should not fail on non-existing device", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure the test is right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're righ. I'll delete this test
pkg/resources/server.go
Outdated
containerResp.Annotations, err = rs.cdi.CreateContainerAnnotations( | ||
container.DevicesIDs, rs.resourceNamePrefix, rs.resourcePool.GetCDIName()) | ||
if err != nil { | ||
return nil, fmt.Errorf("cant create container annotation: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cant -> can't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/resources/server.go
Outdated
@@ -147,8 +162,12 @@ func (rs *resourceServer) ListAndWatch(empty *pluginapi.Empty, stream pluginapi. | |||
devs = append(devs, dev) | |||
} | |||
resp.Devices = devs | |||
err := rs.updateCDISpec() | |||
if err != nil { | |||
glog.Error("cannot update CDI specs", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be errorf with %v
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -147,8 +162,12 @@ func (rs *resourceServer) ListAndWatch(empty *pluginapi.Empty, stream pluginapi. | |||
devs = append(devs, dev) | |||
} | |||
resp.Devices = devs | |||
err := rs.updateCDISpec() | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can have this in the same line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done!
pkg/cdi/cdi.go
Outdated
func (c *impl) CreateCDISpecForPool(resourcePrefix string, rPool types.ResourcePool) error { | ||
err := c.CleanupSpecs() | ||
if err != nil { | ||
glog.Error("CreateCDISpecForPool(): can not cleanup old spec files", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be glog.Errorf("....%v", err)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/cdi/cdi.go
Outdated
annotations := make(map[string]string, 0) | ||
annoKey, err := cdi.AnnotationKey(resourcePrefix, resourceKind) | ||
if err != nil { | ||
glog.Error("CreateContainerAnnotations(): can't create container annotation", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
glog.Errorf("....%v", err)
?
Same here and on line 106
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments about error management.
Beside that, LGTM
thanks, @zeeke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some last small nits
can you please also add a section about CDI in the main document that point to the CDI readme?
@@ -23,6 +23,10 @@ import ( | |||
"github.com/k8snetworkplumbingwg/sriov-network-device-plugin/pkg/types" | |||
) | |||
|
|||
const ( | |||
accelPoolType = "net-accel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be just accel as the accelerators are not only network acceleration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now we decide to leave it with the net-
@@ -25,6 +25,10 @@ import ( | |||
"github.com/k8snetworkplumbingwg/sriov-network-device-plugin/pkg/types" | |||
) | |||
|
|||
const ( | |||
auxPoolType = "net-sf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA use: scalable functions
Intel use: sub functions
so sf is good here :)
da521ce
to
f98ebb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last comment and we can merge this PR!
Great work!
memory: "200Mi" | ||
volumeMounts: | ||
- name: devicesock | ||
mountPath: /var/lib/kubelet/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the mounts to be the same as #500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This commit implements Container Device Interface [1] support. [1] https://github.com/container-orchestrated-devices/container-device-interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Thanks for the feedback, @SchSeba ! |
No description provided.