Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ubuntu24.04] Install the driver in a single step #156

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tariq1890
Copy link
Contributor

@tariq1890 tariq1890 commented Nov 22, 2024

This change condenses the two-step driver install into a single step.

Currently, the driver image installs the userspace components and kernel modules separately. This was to allow for signing of the kernel modules with a custom private key and then relinking the signed kernel modules as well as updating the Kernel module should the underlying kernel host be updated.

As none of these workflow apply today, we simplify the driver installation and allow for defining an API in gpu-operator where users can easily pass custom runfile installation arguments

I have tested Driver upgrades and updates (both Open and ClosedRM modules) with these changes and the driver container has been running with no issues

fi
fi
fi

export IGNORE_CC_MISMATCH=1
make -s -j ${MAX_THREADS} SYSSRC=/lib/modules/${KERNEL_VERSION}/build nv-linux.o nv-modeset-linux.o > /dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing this code, we are no longer using the MAX_THREADS variable which was introduced here: f18cc79

The runfile installer does expose a similar option that controls the concurrency level when building the kernel modules. I am wondering if we can convert MAX_THREADS (if set) to the relevant option and add it to the list of install_args used when installing the run file below.

-j CONCURRENCY-LEVEL, --concurrency-level=CONCURRENCY-LEVEL
      Set the concurrency level for operations such as building the kernel module which may be parallelized on SMP systems. By default, this will be set to the number of detected CPUs, or to '1', if nvidia-installer fails to detect the number of CPUs. Systems with a large number of CPUs will have the default concurrency level limited to 32; setting a higher level on the command line will override this limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks!

Copy link
Contributor Author

@tariq1890 tariq1890 Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed. Please check now

Comment on lines -424 to -429
cd /usr/src/nvidia-${DRIVER_VERSION}
if [ -d /lib/modules/${KERNEL_VERSION}/kernel/drivers/video ]; then
rm -rf /lib/modules/${KERNEL_VERSION}/kernel/drivers/video
else
rm -rf /lib/modules/${KERNEL_VERSION}/video
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- is it safe to remove this? I am unfamiliar with why this logic was required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove this. It wouldn't work otherwise

--x-module-path=/tmp/null \
--x-library-path=/tmp/null \
--x-sysconfig-path=/tmp/null \
-m=${KERNEL_TYPE} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR -- looking at this again, it is probably best to not explicitly set this option unless the user sets KERNEL_TYPE (or whatever API is exposed by the operator). Especially now that we are leveraging the nvidia-installer to perform the compilation and installation, we can depend on the defaults the nvidia-installer applies for KERNEL_TYPE -- for example, on newer driver versions it will automatically choose to install the open modules on compatible systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants