Troubleshoot a DCSMachine Stuck in Deleting
Use this guide when a DCSMachine on a Huawei DCS workload cluster stays in Phase: Deleting for longer than the expected ~60 seconds, blocking node turnover during scale-down, rolling upgrade, or full-cluster teardown.
TOC
ScopeSymptomsReading the VMStopPending ConditionDiagnostic FlowsVM Stuck instoppingVM Stuck in Mid-TransitionController Cannot Reach DCS APIDCS Account LockoutFC Site Policy "仅能删除已停止虚拟机" (errorCode 102212808)Use cpaas.io/retain-vm as an Escape HatchVerify RecoverySee AlsoScope
This guide covers the DCS-side failure modes that prevent cluster-api-provider-dcs from completing the reconcileDelete flow on a node. The provider's normal delete sequence is:
Each step depends on the DCS platform responding. When DCS is unresponsive or refuses a step, the controller surfaces the wait reason on DCSMachine.Status.Conditions.VMStopPending. Reading that condition first usually identifies the problem without needing controller log access.
This guide does not cover normal pre-deletion gating (CAPI Machine drain timeouts, finalizers held by other controllers) — those belong to upstream Cluster API troubleshooting.
Symptoms
If VMStopPending is absent and Machine.status.phase is still Deleting, the issue is more likely upstream of reconcileDelete (e.g., CAPI drain, kubeadm pre-delete hooks). Consult the CAPI troubleshooting docs for that case.
Reading the VMStopPending Condition
The VMStopPending condition is the primary signal for this guide.
Look for:
The Reason field is the diagnostic:
Diagnostic Flows
VM Stuck in stopping
The controller issued a graceful StopVm (calling DCS API /action/stop?mode=safe) and the VM acknowledged but never converged.
Common causes:
- VM
pvDriverStatusis notrunning— guest tools (vmtools / pvdriver) needed for graceful shutdown isn't responding. Verify on the DCS portal: open the VM detail page, check thepvDriverStatusfield. If it shows anything other thanrunning, the VM cannot accept a safe stop. - VM template predates
4.2.1— older DCS VM templates lack the guest tools required for safe stop. Confirm thevmTemplateNamein the DCSMachine matches a4.2.1+template (labelcpaas.io/dcs-vm-templateon thecpaas-system/<release>-dcs-vm-templateConfigMap, see Resolving Placeholder Values). - Guest OS hung — the VM kernel is stuck and is not processing the ACPI shutdown.
Resolution path:
- For (1) and (3): Manually issue a
forcestop from the DCS portal (VM detail → Operations → Stop → Force). Once the VM reachesstopped, the controller's next reconcile picks up the new state and continues withDeleteVm. - For (2): The VM cannot be safely deleted by this provider on that template; back the workload off the node, manually power-off the VM, then use
cpaas.io/retain-vmto skip controller-side delete.
VM Stuck in Mid-Transition
Message shows a status like migrating, starting, paused, or hibernated. The controller will not issue any operation on a VM in these states (DCS would reject) and is waiting indefinitely for the VM to converge to running or stopped.
Controller Cannot Reach DCS API
If the VMStopPending condition is absent but Machine.status.phase is still Deleting for an extended period, the controller may be unable to talk to DCS at all. Check the controller log:
Look for one of:
DCS Account Lockout
errorCode: 10100116 indicates the DCS portal account in the cluster's credential Secret is in a brute-force lockout window. The DCS portal lockout policy resets per failed login, so a controller stuck in a retry loop will indefinitely extend the lockout.
To break the cycle:
Avoid reusing the DCS portal admin account for the provider's Secret in production. Configure a dedicated DCS account (interconnect or domain user, see Credential User Types) with the lockout policy relaxed, so a transient controller failure cannot lock the platform out for everyone.
FC Site Policy "仅能删除已停止虚拟机" (errorCode 102212808)
If the workload cluster runs on a DCS site whose underlying FusionCompute platform has the "delete-only-when-stopped" safety policy enabled, DeleteVm will be rejected when the VM is still running. As of cluster-api-provider-dcs v1.0.18, the provider's normal delete flow handles this automatically by issuing safe StopVm before DeleteVm, but in rare race conditions an external actor (a human operator on the DCS portal, or another automation) can restart the VM between the controller's stop check and the DeleteVm call. The controller surfaces the race as:
This is not an error — the controller treats errorCode 102212808 as recoverable and requeues. The next reconcile re-checks VM status; if running again, the controller re-issues StopVm(safe) and proceeds. Normal recovery time is ~30 seconds.
If you see this errorCode repeatedly (more than 3 cycles), an external actor is racing the controller. Suspend the external automation, or apply cpaas.io/retain-vm to fully take control of VM lifecycle outside the controller.
Use cpaas.io/retain-vm as an Escape Hatch
When the controller cannot finish reconcileDelete for any reason and you need to release the Machine finalizer so the workload cluster can proceed (scale-down, upgrade, removal), annotate either the Machine or the DCSMachine:
Effect:
- The controller's
reconcileDeleterecognizes the annotation, skips bothensureVmStoppedandDeleteVm, and releases theDCSMachinefinalizer. - The
Machinethen transitions out ofDeletingand Cluster API can complete the parent operation.
Side effect — important:
- The VM is not deleted from DCS — it remains on the DCS platform, still consuming compute and storage resources, outside Cluster API's lifecycle management.
- The IP / hostname slot in
DCSIpHostnamePoolis also not released — the pool entry will show as still in-use until you manually free it.
After applying retain-vm, plan to manually:
- Stop and delete the VM on the DCS portal (or via DCS API).
- Free the IP / hostname entry from the corresponding
DCSIpHostnamePool. - (Optional) Clean up any persistent disks that were attached.
Verify Recovery
After applying any of the resolutions above, watch the DCSMachine until the condition clears:
If the Machine is fully gone and the VM is no longer listed on the DCS portal, recovery is complete.
See Also
- Creating Clusters on Huawei DCS — full cluster manifest reference
- Cloud Credentials for Huawei DCS — Secret format and User Types
- Troubleshoot a Workload Cluster Stuck in Provisioned — symptom-adjacent guide for cluster bring-up issues
- Troubleshoot a Cluster Stuck in Deleting — top-level
ClusterCR stuck inDeletingafter every child machine is gone