Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install RT kernel from RPMs included in RHCOS #34

Merged
merged 3 commits into from
Jan 10, 2020
Merged

Install RT kernel from RPMs included in RHCOS #34

merged 3 commits into from
Jan 10, 2020

Conversation

slintes
Copy link
Member

@slintes slintes commented Jan 9, 2020

With this PR we install the RT kernel from the RPMs provided in RHCOS 44.81.
This removes the need for a yum repo.

Not sure if the installation of the RT kernel should be default, but can be disabled on the CRD? Atm it's the other way around, but probably in most cases we want the RT kernel.

TODO:

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2020
@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 9, 2020
echo "[INFO]:labeling worker nodes"
for node in $(${OC_TOOL} get nodes --selector='!node-role.kubernetes.io/master' -o name); do
${OC_TOOL} label $node node-role.kubernetes.io/worker-rt=""
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

ignitionConfig.Systemd = igntypes.Systemd{
Units: []igntypes.Unit{
{
Contents: rtKernelService,
Enabled: pointer.BoolPtr(true),
Enabled: &enableRTKernel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create at all RT kernel unit if the RT disabled, also pre-tunning also depends on the RT kernel unit, so I am interesting if it will work as expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, good point
I'm wondering if we need the option to not install the RT kernel at all, when other performance features depend on it (no clue if all of them do?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than that, pre-tuning uses wants and not requires, so it would be started without rt-kernel (to my best understanding)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK it is correct, if an unit wants another unit and the latter can't be started, the former should run anyway (see: https://unix.stackexchange.com/questions/388586/systemd-requires-vs-wants - which in turn quotes manpages).
Question is however is if makes sense to do pretuning without rt-kernel, like Marc mentionedin on his grandparent comment.

@slintes slintes changed the title WIP Install RT kernel from RPMs included in RHCOS Install RT kernel from RPMs included in RHCOS Jan 9, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2020
@davidvossel
Copy link
Member

weird, the e2e test failed, but it actually looks like it happened after our e2e test suite successfully completed. I'll run it again.

2020/01/09 17:38:37 Container test in pod e2e-gcp completed successfully
2020/01/09 17:38:38 error: unable to gather container logs: [error: Unable to retrieve logs from pod container cli: container "cli" in pod "e2e-gcp" is terminated, error: Unable to retrieve logs from pod container artifacts: container "artifacts" in pod "e2e-gcp" is terminated, error: Unable to retrieve logs from pod container setup: container "setup" in pod "e2e-gcp" is terminated, error: Unable to retrieve logs from pod container teardown: container "teardown" in pod "e2e-gcp" is terminated, error: Unable to retrieve logs from pod container test: container "test" in pod "e2e-gcp" is terminated]
2020/01/09 17:38:38 error: unable to signal to artifacts container to terminate in pod e2e-gcp, triggering deletion: could not run remote command: unable to upgrade connection: container not found ("artifacts")
2020/01/09 17:38:38 error: unable to retrieve artifacts from pod e2e-gcp: could not read gzipped artifacts: unable to upgrade connection: container not found ("artifacts")

@davidvossel
Copy link
Member

/retest

Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good. I just want to see e2e pass before merging.

if err != nil {
return ""
}
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use an AfterEach() instead. I'm not 100% sure what happens if you do a ctrl-c to kill execution when a defer is in use. I know that ginkgo will attempt to run the AfterEach() functions even if you try to kill the test execution early though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point, will change it

trap exit_handler EXIT
# create and mount new container
podman create --net=none --name "$osContainerName" "$imageID" > /dev/null
kernelDir=$(podman mount "$osContainerName")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, so we're delivering kernel pkgs in containers? interesting.

Copy link
Member

@ffromani ffromani Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU this is used to access the packages already shipped in the system image (so we are not shipping anything anymore). But I may have out of date information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is complicated, let me try to summarize:

  • the installer's bootstrap node installs an "old" os bootimage on cluster nodes, which is just new enough to be able to do everything which is needed on first boot
  • then the MCO installs a newer version of the os ("early pivot"). That one is delivered via container image. Maybe you noticed that OCP comes with a machine-os-content image: that one contains the os
  • with RHCOS version 44 the image contains the RT kernel RPMs. They were just put into the root directory of the image
  • so what I do here (stolen from MCO code): mount the image (which is not possible directly, so create (but don't run) a container from it) which contains the os version which is currently booted, and install the RPMs from the mount path.

longer version: https://github.com/openshift/machine-config-operator/blob/master/docs/OSUpgrades.md and others

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slintes great summary. I think it would be great to record this as comment in the rt-kernel.sh script itself. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought REPO_URL was a workaround in 4.3 until we have machineconfig in 4.4 and I see you have a note here as "TEMPORARY WORKAROUND".

Are we having 2 workarounds now for 4.3 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator is for 4.4 only. There we have the RT kernel RPMs already available in the RHCOS image, so no need for a yum repo anymore, but the MCO is not ready to use it yet, see WIP PR [0]. So this is a new temporary workaround for 4.4, unrelated to 4.3.

[0] openshift/machine-config-operator#1330

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fromanirh ah missed your comment
will do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fromanirh done

@@ -0,0 +1,12 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who calls this? only humans or also the operator proper? I guess the former, asking to be sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use a MachineConfig for installing this script and a systemd unit, and systemd will execute the script then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, just noticed I was looking at the wrong script
I called this one manually on test clusters

Copy link
Member

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At glance looks ok, only clarification questions inside

Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2020
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2020
@openshift-merge-robot openshift-merge-robot merged commit 49b6321 into openshift-kni:master Jan 10, 2020
dshchedr pushed a commit to dshchedr/performance-addon-operators that referenced this pull request Mar 26, 2020
remove unused references to worker-rt label
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants