Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-861: Generate an accepted risk for Y-then-Z upgrade #1093

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Oct 4, 2024

https://issues.redhat.com/browse/OTA-861

This one will be rebased after #1094 is merged.

@openshift-ci openshift-ci bot requested review from DavidHurta and wking October 4, 2024 00:33
@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch from 1c1ce20 to 38515f3 Compare October 4, 2024 19:19
@hongkailiu hongkailiu changed the title Block Y stream upgrade if any upgrade is in progress OTA-861: Generate an accepted risk for Y-then-Z upgrade Oct 4, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 4, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 4, 2024

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to this:

This PR adds a guard that blocks an Y-stream upgrade
if there is already an upgrade in progress, reguardless of Y-stream or Z-stream.

For example, it blocks the upgrade to 4.16.1 until the ongoing upgrade
4.14.35 -> 4.15.29 completes.

It also covers the case 4.14.15-> 4.14.35 -> 4.15.29
where the upgrade 4.14.35 -> 4.15.29 is blocked until the upgrade
4.14.15-> 4.14.35 completes.

Note that we still allow for upgrade to 4.y+1.z''
in the middle of upgrade 4.y.z -> 4.y+1.z', even though direct upgrade
4.y.z -> 4.y+1.z'' might not be supported.
This is because the ugprade 4.y.z -> 4.y+1.z' might not be completed
up to a bug in 4.y+1.z' that has a fix in 4.(y+1).z''.
We need the retarget to it to land 4.y+1 on the cluster.

Need rebase after #1080 gets merged.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 4, 2024

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to this:

https://issues.redhat.com/browse/OTA-861

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch 3 times, most recently from f13873d to 9cd287d Compare October 4, 2024 19:28
@hongkailiu
Copy link
Member Author

/test e2e-agnostic-ovn

@hongkailiu
Copy link
Member Author

/test e2e-agnostic-ovn-upgrade-into-change

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 7, 2024
@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch from 9cd287d to 0c472a5 Compare October 7, 2024 18:29
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2024
@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch 6 times, most recently from 7a6d575 to 9c35be0 Compare October 7, 2024 20:11
@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch from 9c35be0 to 0a34c11 Compare October 15, 2024 15:19
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 15, 2024

@hongkailiu: This pull request references OTA-861 which is a valid jira issue.

In response to this:

https://issues.redhat.com/browse/OTA-861

This one will be rebased after #1094 is merged.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2024
@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch from 0a34c11 to 79fee05 Compare October 24, 2024 14:33
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2024
@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch from 79fee05 to 1f15ac9 Compare October 24, 2024 14:35
@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch 3 times, most recently from e702463 to 3218ea0 Compare October 24, 2024 14:53
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 25, 2024
@hongkailiu
Copy link
Member Author

/test e2e-agnostic-ovn-upgrade-out-of-change

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 27, 2024
@hongkailiu
Copy link
Member Author

/retest

@hongkailiu
Copy link
Member Author

/test okd-scos-e2e-aws-ovn

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2024
@hongkailiu
Copy link
Member Author

Testing with 9638070
It covers only the happy path: y-then-z.

build 4.19,openshift/cluster-version-operator#1093
The job is here.

launch 4.18 gcp

$ oc get clusterversion version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.nightly-2024-11-30-141716   True        False         7m30s   Cluster version is 4.18.0-0.nightly-2024-11-30-141716

### --force up to the verification failure of the payload
$ oc adm upgrade --to-image registry.build05.ci.openshift.org/ci-ln-70qgc2b/release:latest --force --allow-explicit-upgrade

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.19.0-0.test-2024-12-02-213010-ci-ln-70qgc2b-latest: 111 of 902 done (12% complete), waiting on etcd, kube-apiserver

Upgradeable=False

  Reason: UpdateInProgress
  Message: An update is already in progress and the details are in the Progressing condition

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

### https://amd64.ocp.releases.ci.openshift.org/releasestream/4.19.0-0.ci/release/4.19.0-0.ci-2024-12-02-120136
$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:7b60d49a7809c74533289588df47037c4543e9e0608f1c4ad6307f737474e6d3 --allow-explicit-upgrade --allow-upgrade-with-warnings

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.19.0-0.test-2024-12-02-213010-ci-ln-70qgc2b-latest: 111 of 902 done (12% complete), waiting on etcd, kube-apiserver

Upgradeable=False

  Reason: UpdateInProgress
  Message: An update is already in progress and the details are in the Progressing condition

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:7b60d49a7809c74533289588df47037c4543e9e0608f1c4ad6307f737474e6d3" failure=The update cannot be verified: unable to verify sha256:7b60d49a7809c74533289588df47037c4543e9e0608f1c4ad6307f737474e6d3 against keyrings: verifier-public-key-redhat

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:7b60d49a7809c74533289588df47037c4543e9e0608f1c4ad6307f737474e6d3 --allow-explicit-upgrade --allow-upgrade-with-warnings --force

$ oc get clusterversion version -o yaml | yq -y '.status.history[0].acceptedRisks'
'Target release version="" image="registry.ci.openshift.org/ocp/release@sha256:7b60d49a7809c74533289588df47037c4543e9e0608f1c4ad6307f737474e6d3"
  cannot be verified, but continuing anyway because the update was forced: unable
  to verify sha256:7b60d49a7809c74533289588df47037c4543e9e0608f1c4ad6307f737474e6d3
  against keyrings: verifier-public-key-redhat

  [2024-12-02T22:26:06Z: prefix sha256-7b60d49a7809c74533289588df47037c4543e9e0608f1c4ad6307f737474e6d3
  in config map signatures-managed: no more signatures to check, 2024-12-02T22:26:06Z:
  unable to retrieve signature from https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=7b60d49a7809c74533289588df47037c4543e9e0608f1c4ad6307f737474e6d3/signature-1:
  no more signatures to check, 2024-12-02T22:26:06Z: unable to retrieve signature
  from https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=7b60d49a7809c74533289588df47037c4543e9e0608f1c4ad6307f737474e6d3/signature-1:
  no more signatures to check, 2024-12-02T22:26:06Z: parallel signature store wrapping
  containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release,
  containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release:
  no more signatures to check, 2024-12-02T22:26:06Z: serial signature store wrapping
  ClusterVersion signatureStores unset, falling back to default stores, parallel signature
  store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release,
  containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release:
  no more signatures to check, 2024-12-02T22:26:06Z: serial signature store wrapping
  config maps in openshift-config-managed with label "release.openshift.io/verification-signatures",
  serial signature store wrapping ClusterVersion signatureStores unset, falling back
  to default stores, parallel signature store wrapping containers/image signature
  store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release,
  containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release:
  no more signatures to check]

  Forced through blocking failures: Multiple precondition checks failed:

  * Precondition "ClusterVersionRollback" failed because of "LowDesiredVersion": 4.19.0-0.ci-2024-12-02-120136
  is less than the current target 4.19.0-0.test-2024-12-02-213010-ci-ln-70qgc2b-latest,
  and the only supported rollback is to the cluster''s previous version 4.18.0-0.nightly-2024-11-30-141716
  (registry.build02.ci.openshift.org/ci-ln-z83g5lt/release@sha256:31b1d1ef6aaefe25c087ba7320b8e301d3c1ca71ec574fdebdaa42c4884b4e29)

  * Precondition "ClusterVersionUpgradeable" failed because of "MinorVersionClusterUpdateInProgress":
  Retarget to 4.19.0-0.ci-2024-12-02-120136 while a minor level update from 4.18.0-0.nightly-2024-11-30-141716
  to 4.19.0-0.test-2024-12-02-213010-ci-ln-70qgc2b-latest is in progress

  * Precondition "ClusterVersionRecommendedUpdate" failed because of "NoChannel":
  Configured channel is unset, so the recommended status of updating from 4.19.0-0.test-2024-12-02-213010-ci-ln-70qgc2b-latest
  to 4.19.0-0.ci-2024-12-02-120136 is unknown.'

The relevant part is

* Precondition "ClusterVersionUpgradeable" failed because of "MinorVersionClusterUpdateInProgress":
Retarget to 4.19.0-0.ci-2024-12-02-120136 while a minor level update from 4.18.0-0.nightly-2024-11-30-141716
to 4.19.0-0.test-2024-12-02-213010-ci-ln-70qgc2b-latest is in progress

We may wait until 4.19 has a EC so that --force is not needed in the 2nd upgrade command.

@JianLi-RH
Copy link

hi @hongkailiu I test it by below two paths (below intermediate builds are created by build 4.1x,openshift/cluster-version-operator#1093):

  1. 4.17 -> 4.18 intermediate -> 4.18 ec
  2. 4.18 -> 4.19 intermediate -> 4.19 rc

The second one has same result as you posted above.
Here are the test steps of first one:

  1. Create Image
build 4.18,openshift/cluster-version-operator#1093
  1. setup an OCP 4.17 cluster
    https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/324820/
[jianl@jianl-thinkpadt14gen4 417]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-12-02-175016   True        False         32m     Cluster version is 4.17.0-0.nightly-2024-12-02-175016
[jianl@jianl-thinkpadt14gen4 417]$ oc adm upgrade
Cluster version is 4.17.0-0.nightly-2024-12-02-175016

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.17
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.17.0-0.nightly-2024-12-02-175016 not found in the "stable-4.17" channel

[jianl@jianl-thinkpadt14gen4 417]$
  1. Upgrade to above image
[jianl@jianl-thinkpadt14gen4 417]$ oc adm upgrade --to-image registry.build05.ci.openshift.org/ci-ln-g35izwk/release:latest --force --allow-explicit-upgrade
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build05.ci.openshift.org/ci-ln-g35izwk/release:latest
[jianl@jianl-thinkpadt14gen4 417]$ 

[jianl@jianl-thinkpadt14gen4 417]$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.18.0-0.test-2024-12-03-072424-ci-ln-g35izwk-latest: 2 of 902 done (0% complete)

Upgradeable=False

  Reason: UpdateInProgress
  Message: An update is already in progress and the details are in the Progressing condition

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.17
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.18.0-0.test-2024-12-03-072424-ci-ln-g35izwk-latest not found in the "stable-4.17" channel

[jianl@jianl-thinkpadt14gen4 417]$ 
  1. Retarget to a 4.18 ec build:
[jianl@jianl-thinkpadt14gen4 417]$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:8e097e389656b8b6e362c596b08f929d9271b4f841570f310b13b497a4f2b7d9 --allow-explicit-upgrade --allow-upgrade-with-warnings
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading:

  Reason: ClusterOperatorsUpdating
  Message: Working towards 4.18.0-0.test-2024-12-03-072424-ci-ln-g35izwk-latest: 111 of 902 done (12% complete), waiting on etcd, kube-apiserver
Requested update to release image quay.io/openshift-release-dev/ocp-release@sha256:8e097e389656b8b6e362c596b08f929d9271b4f841570f310b13b497a4f2b7d9
[jianl@jianl-thinkpadt14gen4 417]$

Wait some seconds:

[jianl@jianl-thinkpadt14gen4 417]$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.18.0-ec.4: 6 of 892 done (0% complete)

Upgradeable=False

  Reason: KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade
  Message: Cluster operator kube-apiserver should not be upgraded between minor versions: KubeletMinorVersionUpgradeable: Kubelet minor versions on 6 nodes will not be supported in the next OpenShift minor version upgrade.

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.17
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.18.0-ec.4 not found in the "stable-4.17" channel

[jianl@jianl-thinkpadt14gen4 417]$ 

Get latest history:

[jianl@jianl-thinkpadt14gen4 417]$ oc get clusterversion version -o yaml | yq -y '.status.history[0]'
acceptedRisks: 'Multiple precondition checks failed:

  * Precondition "ClusterVersionUpgradeable" failed because of "MinorVersionClusterUpdateInProgress":
  Retarget to 4.18.0-ec.4 while a minor level update from 4.17.0-0.nightly-2024-12-02-175016
  to 4.18.0-0.test-2024-12-03-072424-ci-ln-g35izwk-latest is in progress

  * Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate":
  RetrievedUpdates=False (VersionNotFound), so the recommended status of updating
  from 4.18.0-0.test-2024-12-03-072424-ci-ln-g35izwk-latest to 4.18.0-ec.4 is unknown.'
completionTime: null
image: quay.io/openshift-release-dev/ocp-release@sha256:8e097e389656b8b6e362c596b08f929d9271b4f841570f310b13b497a4f2b7d9
startedTime: '2024-12-03T08:34:29Z'
state: Partial
verified: true
version: 4.18.0-ec.4
[jianl@jianl-thinkpadt14gen4 417]$ 

@JianLi-RH
Copy link

Continue with above test #1.
When the upgrade finished, upgrade to 4.18.0-rc.0, then check the history:

[jianl@jianl-thinkpadt14gen4 417]$ oc get clusterversion version -o yaml | yq -y '.status.history'
- acceptedRisks: 'Precondition "ClusterVersionRecommendedUpdate" failed because of
    "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended
    status of updating from 4.18.0-ec.4 to 4.18.0-rc.0 is unknown.'
  completionTime: null
  image: quay.io/openshift-release-dev/ocp-release@sha256:054e75395dd0879e8c29cd059cf6b782742123177a303910bf78f28880431d1c
  startedTime: '2024-12-03T09:43:17Z'
  state: Partial
  verified: true
  version: 4.18.0-rc.0
- completionTime: '2024-12-03T09:39:17Z'
  image: quay.io/openshift-release-dev/ocp-release@sha256:8e097e389656b8b6e362c596b08f929d9271b4f841570f310b13b497a4f2b7d9
  startedTime: '2024-12-03T08:34:29Z'
  state: Completed
  verified: true
  version: 4.18.0-ec.4
- completionTime: '2024-12-03T08:34:29Z'
  image: registry.build05.ci.openshift.org/ci-ln-g35izwk/release:latest
  startedTime: '2024-12-03T08:32:25Z'
  state: Partial
  verified: false
  version: 4.18.0-0.test-2024-12-03-072424-ci-ln-g35izwk-latest
- completionTime: '2024-12-03T07:55:55Z'
  image: registry.ci.openshift.org/ocp/release@sha256:cf3ee19e001a98a5860d88bae68ac76cbdb7336ae5764f2147b9ea72349cb4d6
  startedTime: '2024-12-03T07:20:41Z'
  state: Completed
  verified: false
  version: 4.17.0-0.nightly-2024-12-02-175016
[jianl@jianl-thinkpadt14gen4 417]$ 

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 13, 2024
@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch from 9638070 to d7a583a Compare January 3, 2025 14:57
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 3, 2025
@hongkailiu hongkailiu force-pushed the OTA-861-better-guard branch from d7a583a to cb7a835 Compare January 3, 2025 15:04
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 3, 2025
Copy link
Contributor

openshift-ci bot commented Jan 3, 2025

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn cb7a835 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-agnostic-ovn-upgrade-out-of-change cb7a835 link true /test e2e-agnostic-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2025
Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants