{"id":168,"date":"2023-07-13T14:36:03","date_gmt":"2023-07-13T12:36:03","guid":{"rendered":"https:\/\/blog.openshift.one\/?p=168"},"modified":"2023-09-19T12:10:36","modified_gmt":"2023-09-19T10:10:36","slug":"openshift-upgrade-got-stuck-unexpected-on-disk-state-target-osimageurl-recovery","status":"publish","type":"post","link":"https:\/\/blog.openshift.one\/index.php\/2023\/07\/13\/openshift-upgrade-got-stuck-unexpected-on-disk-state-target-osimageurl-recovery\/","title":{"rendered":"OpenShift upgrade got stuck &#8211; unexpected on-disk state &#8211;  target osImageURL &#8211; recovery"},"content":{"rendered":"\n<p><strong>DISCLAIMER: This post is based on my very own and unique experience I went through during work in my lab. In the other words &#8211; <span style=\"text-decoration: underline;\">your mileage may vary<\/span>. Don&#8217;t treat it as an ultimate solution. If you have production cluster get in touch with Red Hat support before making any changes.<\/strong><\/p>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>I have a three node, compact cluster (running masters only) virtualised on a single baremetal server. This is my lab so explosions are likely to happen and my configuration is not supported by Red Hat in any way. <\/p>\n\n\n\n<p>I was performing OpenShift 4.12.19 to 4.13.1 upgrade but the process got stuck because one of the nodes couldn&#8217;t drain. This was because of disruption budged together with AntiAffinity rule didn&#8217;t let container go. Instead of finding which container it was I decided to go with a shortcut and rebooted the node. <strong>That was wrong<\/strong> \ud83d\ude42<\/p>\n\n\n\n<p>Node got rebooted but MachineConfigOperator was reporting master pool degraded:<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background has-small-font-size\"><code>$ oc get mcp\nNAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE\nmaster   rendered-master-9c7420d8aa28803bc87c59122fc855b1   False     True       True      3              2                   2                     <strong>1<\/strong>                      79d\nworker   rendered-worker-2c8c19c25eed12594bf4117d11319867   True      False      False      0              0                   0                     0                      79d<\/code><\/pre>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>List of the nodes indicated one of them hasn&#8217;t been updated and still running old version of Kubernetes as bellow:<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background has-small-font-size\"><code>$ oc get nodes\nNAME       STATUS     ROLES                         AGE     VERSION\nmaster-1   Ready      control-plane,master,worker   6m27s   <strong>v1.25.8+37a9a08<\/strong>\nmaster-2   Ready      control-plane,master,worker   8d      v1.26.3+b404935\nmaster-3   Ready      control-plane,master,worker   22h     v1.26.3+b404935<\/code><\/pre>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>To troubleshoot the issue I switch to <code>openshift-machine-config-operator<\/code> project and found the pod running machine-config-daemon on the affected node:<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background\"><code>$ oc get pods -o wide\nNAME                                        READY   STATUS    RESTARTS   AGE    IP                NODE       NOMINATED NODE   READINESS GATES\nmachine-config-daemon-2nf25                 2\/2     Running   0          8d     192.168.232.124   master-2   &lt;none&gt;           &lt;none&gt;\n<strong>machine-config-daemon-6lc6x                 2\/2     Running   0          8m     192.168.232.123   master-1   &lt;none&gt;           &lt;none&gt;\n<\/strong>machine-config-daemon-stsnj                 2\/2     Running   0          22h    192.168.232.122   master-3   &lt;none&gt;           &lt;none&gt;<\/code><\/pre>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Checking its log showed precisely what went wrong. Reboot of the node caused desync in what MCO expects on the node (it expects the node to run already updated image version: <code>quay.io\/openshift-release-dev\/ocp-v4.0-art-dev@sha256:d2aa8899d6ec5cd40bbe7b843027148b768f0a5b8ab091aa46958c4893814306<\/code>) and what it really finds there (the node image was not really updated and it still runs the old version &#8211; <code>quay.io\/openshift-release-dev\/ocp-v4.0-art-dev@sha256:df4c3b1ad3c665bc4d7a73d78014645a63ee4518cbd515efa8bee68a83444738<\/code>).<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background has-small-font-size\"><code>E0712 14:36:56.237472    3378 writer.go:200] Marking Degraded due to: unexpected on-disk state validating against rendered-master-280af3b80aac4ca3a83b3107bdefe409: expected target osImageURL \"quay.io\/openshift-release-dev\/ocp-v4.0-art-dev@sha256:d2aa8899d6ec5cd40bbe7b843027148b768f0a5b8ab091aa46958c4893814306\", have \"quay.io\/openshift-release-dev\/ocp-v4.0-art-dev@sha256:df4c3b1ad3c665bc4d7a73d78014645a63ee4518cbd515efa8bee68a83444738\" (\"85a1a0c0a7be436c69f743cd2d9538f5fde69ce63eb810ffe3bd9abe122aa5ff\")<\/code><\/pre>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Now I somehow need to encourage MCO to perform the upgrade once again. I found few examples how to do it but none of them was working for me and I was always ending up with degraded node, because of this unexpected on-disk state. <\/p>\n\n\n\n<p>Here is what I found working:<\/p>\n\n\n\n<p>Find rendered master MachineConfig which refers to osImageURL which is currently being used on the affected node, for an instance:<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background\"><code>$ oc project openshift-machine-config-operator\nUsing project \"openshift-machine-config-operator\" on server \"https:\/\/api.ocp4.example.com:6443\".\n$ oc get mc | awk '$0 ~ \/rendered-master\/ {print $1}' | while read MC; do oc get mc ${MC} -o yaml &gt; ${MC}.yaml; done\n$ ls rendered-master-*\nrendered-master-280af3b80aac4ca3a83b3107bdefe409.yaml\trendered-master-9c7420d8aa28803bc87c59122fc855b1.yaml\nrendered-master-34cb6b8b7309d8a36043c198f3349034.yaml\trendered-master-d02ab2bac47f31a7d32b64ab43af8c8b.yaml\nrendered-master-38a19ea84a27cc9a437da101a8e61fd2.yaml\trendered-master-d0a726600ac86d0e933e5d41ec1d1ace.yaml\nrendered-master-4f43c4fd6281684dbf2920305f5df0a4.yaml\n$ grep df4c3b1ad3c665bc4d7a73d78014645a63ee4518cbd515efa8bee68a83444738 rendered-master-*\nrendered-master-38a19ea84a27cc9a437da101a8e61fd2.yaml:  osImageURL: quay.io\/openshift-release-dev\/ocp-v4.0-art-dev@sha256:df4c3b1ad3c665bc4d7a73d78014645a63ee4518cbd515efa8bee68a83444738<\/code><\/pre>\n\n\n\n<p>So I know the last (and only one) rendered-master MachineConfig is rendered-master-38a19ea84a27cc9a437da101a8e61fd2.<\/p>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Go to the affected node and delete <code>\/etc\/machine-config-daemon\/currentconfig<\/code> file:<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background\"><code>$ oc debug node\/master-1\nStarting pod\/master-1-debug ...\nTo use host binaries, run `chroot \/host`\nchroot \/host\nPod IP: 192.168.1.10\nIf you don't see a command prompt, try pressing enter.\nsh-4.4# chroot \/host\nsh-5.1# rm \/etc\/machine-config-daemon\/currentconfig<\/code><\/pre>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Edit node&#8217;s annotations and set the following metadata.annotations as following:<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background\"><code>    machineconfiguration.openshift.io\/currentConfig: <strong>rendered-master-38a19ea84a27cc9a437da101a8e61fd2<\/strong>\n    machineconfiguration.openshift.io\/desiredConfig: rendered-master-9c7420d8aa28803bc87c59122fc855b1\n    machineconfiguration.openshift.io\/reason: \"\"\n    machineconfiguration.openshift.io\/ssh: accessed\n    machineconfiguration.openshift.io\/state: Done<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>machineconfiguration.openshift.io\/currentConfig <\/code>&#8211; has to be set to MachineConfig found in the previous step (the last one which has existing <code>osImageURL<\/code> being used on the affected node).<\/li>\n\n\n\n<li><code>machineconfiguration.openshift.io\/desiredConfig<\/code> &#8211; most likely doesn&#8217;t have to be changed as it points to the MachineConfig which contains the new Image version to be installed on the node<\/li>\n\n\n\n<li><code>machineconfiguration.openshift.io\/reason<\/code> &#8211; make it an empty string<\/li>\n\n\n\n<li><code>machineconfiguration.openshift.io\/ssh <\/code>&#8211; set it to accessed if it isn&#8217;t already<\/li>\n\n\n\n<li><code>machineconfiguration.openshift.io\/state<\/code> &#8211; set it to <code>Done<\/code><\/li>\n<\/ul>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Get back to the node and touch <code>\/run\/machine-config-daemon-force<\/code> file so MachineConfigDaemon will re-attempt node upgrade:<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background\"><code>sh-5.1# touch \/run\/machine-config-daemon-force<\/code><\/pre>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>At this stage MachineConfigDaemon should restart node upgrade, deploy new image and reboot the node. You can observe it in logs of the relevant machine-config-daemon pod or directly on the node<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background\"><code>sh-5.1# journalctl -fl\nJul 13 08:16:43 master-1 root&#91;28206]: machine-config-daemon&#91;6691]: Skipping on-disk validation; \/run\/machine-config-daemon-force present\nJul 13 08:16:43 master-1 root&#91;28207]: machine-config-daemon&#91;6691]: Starting update from rendered-master-38a19ea84a27cc9a437da101a8e61fd2 to rendered-master-9c7420d8aa28803bc87c59122fc855b1: &amp;{osUpdate:true kargs:true fips:false passwd:false files:true units:true kernelType:false extensions:false}\nJul 13 08:16:43 master-1 root&#91;28208]: machine-config-daemon&#91;6691]: drain is already completed on this node\n(...)\nJul 13 08:17:23 master-1 root&#91;29671]: machine-config-daemon&#91;6691]: Rebooting node\nJul 13 08:17:23 master-1 root&#91;29672]: machine-config-daemon&#91;6691]: initiating reboot: Node will reboot into config rendered-master-9c7420d8aa28803bc87c59122fc855b1\nJul 13 08:17:23 master-1 systemd&#91;1]: Started machine-config-daemon: Node will reboot into config rendered-master-9c7420d8aa28803bc87c59122fc855b1.\nJul 13 08:17:23 master-1 root&#91;29675]: machine-config-daemon&#91;6691]: reboot successful\nJul 13 08:17:23 master-1 systemd-logind&#91;1197]: System is rebooting.<\/code><\/pre>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>If you&#8217;re lucky you should see updated node shortly back to the cluster:<\/p>\n\n\n\n<pre class=\"wp-block-code has-background-color has-foreground-background-color has-text-color has-background\"><code>NAME       STATUS   ROLES                         AGE     VERSION\nmaster-1   Ready    control-plane,master,worker   10m     <strong>v1.26.3+b404935<\/strong>\nmaster-2   Ready    control-plane,master,worker   8d      v1.26.3+b404935\nmaster-3   Ready    control-plane,master,worker   22h     v1.26.3+b404935<\/code><\/pre>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p><strong>If you&#8217;re unlucky and node still reports disk inconsistency<\/strong> you may be a victim of race-condition between you and machine-config-daemon. This isn&#8217;t fully confirmed nor proven but I am aware about the case where machine-config-daemon was reverting changes in node&#8217;s annotations after they were edited and before node was rebooted. For that reason I recommend to give it a try and run two sessions: one with editor, the other one with shell on the affected node, to ensure once node annotations are being updated and saved, reboot is being triggered quickly enough to do not give machine-config-daemon of reverting node&#8217;s annotations. I will document it further once I face similar case again.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>DISCLAIMER: This post is based on my very own and unique experience I went through during work in my lab. In the other words &#8211; your mileage may vary. Don&#8217;t treat it as an ultimate solution. If you have production cluster get in touch with Red Hat support before making any changes. I have a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[15,9,18,17,16],"class_list":["post-168","post","type-post","status-publish","format-standard","hentry","category-openshift","tag-machineconfigoperator","tag-openshift","tag-recovery","tag-stuck","tag-upgrade"],"_links":{"self":[{"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/posts\/168","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/comments?post=168"}],"version-history":[{"count":6,"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/posts\/168\/revisions"}],"predecessor-version":[{"id":177,"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/posts\/168\/revisions\/177"}],"wp:attachment":[{"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/media?parent=168"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/categories?post=168"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.openshift.one\/index.php\/wp-json\/wp\/v2\/tags?post=168"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}