Upgrading the global Cluster

This document describes how to upgrade a global cluster that runs on Immutable Infrastructure. Upgrades replace nodes with new MicroOS images managed by the Cluster API provider; in-place node upgrades are not used.

When to Use This Path

Choose this upgrade path when:

  • The global cluster was originally installed on Immutable Infrastructure. See Installing the global Cluster.
  • Your infrastructure is one of the documented providers: Huawei DCS or Huawei Cloud Stack. VMware vSphere and bare-metal support for the global cluster are planned.

For traditional-OS global clusters, use the standard upgrade path instead.

Two-Phase Upgrade Overview

Like workload clusters, the global cluster on Immutable Infrastructure follows a two-phase upgrade.

  1. Phase 1 — Distribution Version: aligned and agnostic extensions are upgraded to the target Distribution Version. The procedure is shared with workload clusters; see Upgrading Clusters for the Phase 1 mechanics.
  2. Phase 2 — Kubernetes and OS Image: nodes are replaced with new MicroOS images that contain the target Kubernetes version. This document focuses on Phase 2 for the global cluster.
Phase 1 Compatibility

Before starting Phase 2, verify that every workload cluster falls within the Compatible Versions matrix of the target Distribution Version. Workload clusters that are out of range must be upgraded first.

Common Prerequisites

  • The global cluster has completed Phase 1 (Distribution Version upgrade).
  • An etcd backup of the global cluster has been taken and verified.
  • The new MicroOS image and the matching KubeadmControlPlane and MachineDeployment versions are available on the platform's registry.
  • A maintenance window plan that accounts for rolling control plane replacement.

Procedure

After installation, the Cluster API controllers that manage the global cluster run on the global cluster itself. Use the global kubeconfig for the kubectl commands in this procedure.

Step 1 — Update the global Cluster Manifest

Update the Cluster API manifests of the global cluster to reference the new MicroOS image and Kubernetes version. The manifest fields to update are provider-specific.

Huawei DCS
VMware vSphere
Huawei Cloud Stack
Bare Metal

For DCS, create new immutable infrastructure templates instead of editing templates that are already referenced by running machines.

Update the control plane resources:

  • Create a new DCSMachineTemplate for the target image and set spec.template.spec.vmTemplateName to the MicroOS template that matches the target Kubernetes version.
  • Keep preserved node-local data, including /var/cpaas, in DCSIpHostnamePool.spec.pool[].persistentDisk. Do not move preserved disks back into DCSMachineTemplate.
  • Set KubeadmControlPlane.spec.version to the target Kubernetes version.
  • Point KubeadmControlPlane.spec.machineTemplate.infrastructureRef.name to the new DCSMachineTemplate.
  • Keep KubeadmControlPlane.spec.rolloutStrategy.rollingUpdate.maxSurge: 0 when the cluster uses pool-managed persistent disks.

Update worker node resources:

  • Create a new worker DCSMachineTemplate with the target vmTemplateName.
  • Set each MachineDeployment.spec.template.spec.version to the target Kubernetes version.
  • Point each MachineDeployment.spec.template.spec.infrastructureRef.name to the new worker DCSMachineTemplate.
  • Keep each MachineDeployment.spec.strategy.rollingUpdate.maxSurge: 0 when the worker pool uses pool-managed persistent disks.

Pool-managed persistent disks are declared on the IP pool, not on the machine template:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DCSIpHostnamePool
metadata:
  name: <global-pool-name>
  namespace: cpaas-system
spec:
  pool:
    - ip: <node-ip>
      hostname: <node-hostname>
      persistentDisk:
        - slot: 0
          quantityGB: 40
          datastoreName: <datastore-name>
          path: /var/cpaas
          format: xfs
          mountOptions:
            - defaults

Use the IP pool status to confirm that preserved disks are detached from old VMs and attached to replacement VMs during the rolling replacement.

Step 2 — Apply the Updated Manifest

Apply the updated manifest against the global cluster.

kubectl --kubeconfig <global-kubeconfig> apply -f <updated-manifest>

The Cluster API provider begins replacing control plane and worker nodes by using the new image. When maxSurge: 0 is set, each old node is drained and deleted before its replacement can reuse the same fixed identity, IP address, or preserved disk.

Step 3 — Monitor the Rolling Replacement

Watch the rolling replacement until all control plane and worker nodes have been replaced.

kubectl --kubeconfig <global-kubeconfig> get machines -A -o wide
kubectl --kubeconfig <global-kubeconfig> get kubeadmcontrolplane -A

The upgrade is complete when every Machine reports the new Kubernetes version and Phase: Running, and the KubeadmControlPlane reports Ready: True against the new version.

Verification

After the rolling replacement finishes, verify that the upgraded global cluster is healthy.

kubectl --kubeconfig <global-kubeconfig> get nodes -o wide
kubectl --kubeconfig <global-kubeconfig> get clusterversionshadow -o yaml
kubectl --kubeconfig <global-kubeconfig> get pods -n cpaas-system

All nodes must report the new Kubernetes version, the ClusterVersionShadow must reflect the target Distribution Version, and core platform pods must be Running.

Rollback Considerations

Rollback after a partial Phase 2 upgrade is provider-specific. In general:

If the upgrade has not yet replaced any control plane node, revert the manifest to the previous image and reapply. If control plane nodes have been replaced, restore from the etcd backup taken before starting the upgrade, then revert the manifest.

Huawei DCS
VMware vSphere
Huawei Cloud Stack
Bare Metal

For DCS clusters that use pool-managed persistent disks, confirm disk state before rollback:

First, check DCSIpHostnamePool.status.persistentDiskStatus before deleting or recreating machines. Do not delete retained DCS volumes that are listed in DCSIpHostnamePool.spec.pool[].persistentDisk.

Keep maxSurge: 0 while reverting to the previous machine templates so replacement happens one node at a time. If the control plane was already replaced and cluster state is inconsistent, restore from the verified etcd backup before reapplying the previous manifest.

Next Steps