Troubleshoot a DCSMachine Stuck in Deleting

Use this guide when a DCSMachine on a Huawei DCS workload cluster stays in Phase: Deleting for longer than the expected ~60 seconds, blocking node turnover during scale-down, rolling upgrade, or full-cluster teardown.

Scope

This guide covers the DCS-side failure modes that prevent cluster-api-provider-dcs from completing the reconcileDelete flow on a node. The provider's normal delete sequence is:

running VM  →  safe StopVm  →  stopped  →  (detach persistent disks if any)  →  DeleteVm  →  VM gone in DCS

Each step depends on the DCS platform responding. When DCS is unresponsive or refuses a step, the controller surfaces the wait reason on DCSMachine.Status.Conditions.VMStopPending. Reading that condition first usually identifies the problem without needing controller log access.

This guide does not cover normal pre-deletion gating (CAPI Machine drain timeouts, finalizers held by other controllers) — those belong to upstream Cluster API troubleshooting.

Symptoms

Where to lookWhat you see
Machine.status.phase (on the global cluster)Deleting for > 5 minutes
DCSMachine.status.conditions[?(@.type=="VMStopPending")]Status: False with a Reason and Message describing the wait
DCS portal VM viewThe VM is still listed (running, stopping, or some other non-stopped state)
Controller log on the global clusterA cluster-api-provider-dcs-manager line such as issued safe StopVm or DeleteVm rejected by FC site policy

If VMStopPending is absent and Machine.status.phase is still Deleting, the issue is more likely upstream of reconcileDelete (e.g., CAPI drain, kubeadm pre-delete hooks). Consult the CAPI troubleshooting docs for that case.

Reading the VMStopPending Condition

The VMStopPending condition is the primary signal for this guide.

kubectl -n <ns> describe dcsmachine <name>

Look for:

Conditions:
  Type:               VMStopPending
  Status:             False
  Severity:           Warning
  Reason:             VMMidTransition
  Message:            VM stuck in stopping; awaiting transition to stopped

The Reason field is the diagnostic:

ReasonSeverityWhat it meansNext step
WaitingForStopInfoSafe StopVm was issued; controller polls every 10s for stoppedWait up to ~60s. If the VM stays in stopping past that, jump to VM Stuck in stopping.
VMMidTransitionWarningVM is in a transient state other than running / stopped (e.g., stopping, starting, migrating, paused, hibernated)Inspect the Message field for the specific status. See VM Stuck in Mid-Transition.
(condition absent)VM is either already stopped or already gone — controller is past the stop phase.Check the DCS portal for VM existence and the Machine finalizers.

Diagnostic Flows

VM Stuck in stopping

The controller issued a graceful StopVm (calling DCS API /action/stop?mode=safe) and the VM acknowledged but never converged.

Common causes:

  1. VM pvDriverStatus is not running — guest tools (vmtools / pvdriver) needed for graceful shutdown isn't responding. Verify on the DCS portal: open the VM detail page, check the pvDriverStatus field. If it shows anything other than running, the VM cannot accept a safe stop.
  2. VM template predates 4.2.1 — older DCS VM templates lack the guest tools required for safe stop. Confirm the vmTemplateName in the DCSMachine matches a 4.2.1+ template (label cpaas.io/dcs-vm-template on the cpaas-system/<release>-dcs-vm-template ConfigMap, see Resolving Placeholder Values).
  3. Guest OS hung — the VM kernel is stuck and is not processing the ACPI shutdown.

Resolution path:

  • For (1) and (3): Manually issue a force stop from the DCS portal (VM detail → Operations → Stop → Force). Once the VM reaches stopped, the controller's next reconcile picks up the new state and continues with DeleteVm.
  • For (2): The VM cannot be safely deleted by this provider on that template; back the workload off the node, manually power-off the VM, then use cpaas.io/retain-vm to skip controller-side delete.

VM Stuck in Mid-Transition

Message shows a status like migrating, starting, paused, or hibernated. The controller will not issue any operation on a VM in these states (DCS would reject) and is waiting indefinitely for the VM to converge to running or stopped.

Status in messageLikely causeResolution
migratingPlatform-initiated vMotion / live migrationWait. Migration completes on its own; controller resumes when status reaches running or stopped.
startingPlatform initiated VM start after a power eventWait until running, then deletion proceeds.
paused / hibernatedPlatform-paused VM (does not self-recover)Resume the VM manually from the DCS portal (Operations → Resume), or use cpaas.io/retain-vm to bypass and clean up the VM later.

Controller Cannot Reach DCS API

If the VMStopPending condition is absent but Machine.status.phase is still Deleting for an extended period, the controller may be unable to talk to DCS at all. Check the controller log:

kubectl -n cpaas-system logs deployment/cluster-api-provider-dcs-manager --tail=50

Look for one of:

Log keywordCause
connection refused / timeoutDCS API endpoint unreachable from the global cluster
errorCode: 10100116 "帐户锁定中"DCS admin account locked. See DCS Account Lockout.
401 UnauthorizedCredential Secret is stale; rotate per Cloud Credentials.

DCS Account Lockout

errorCode: 10100116 indicates the DCS portal account in the cluster's credential Secret is in a brute-force lockout window. The DCS portal lockout policy resets per failed login, so a controller stuck in a retry loop will indefinitely extend the lockout.

To break the cycle:

# 1. Scale the manager to 0 so it stops retrying
kubectl -n cpaas-system scale deployment cluster-api-provider-dcs-manager --replicas=0

# 2. Wait 5 minutes for the DCS lockout window to expire naturally
sleep 300

# 3. Restore the manager
kubectl -n cpaas-system scale deployment cluster-api-provider-dcs-manager --replicas=1

# 4. Verify the controller is happy
kubectl -n cpaas-system logs deployment/cluster-api-provider-dcs-manager --tail=20

Avoid reusing the DCS portal admin account for the provider's Secret in production. Configure a dedicated DCS account (interconnect or domain user, see Credential User Types) with the lockout policy relaxed, so a transient controller failure cannot lock the platform out for everyone.

FC Site Policy "仅能删除已停止虚拟机" (errorCode 102212808)

If the workload cluster runs on a DCS site whose underlying FusionCompute platform has the "delete-only-when-stopped" safety policy enabled, DeleteVm will be rejected when the VM is still running. As of cluster-api-provider-dcs v1.0.18, the provider's normal delete flow handles this automatically by issuing safe StopVm before DeleteVm, but in rare race conditions an external actor (a human operator on the DCS portal, or another automation) can restart the VM between the controller's stop check and the DeleteVm call. The controller surfaces the race as:

Controller log:
  INFO  DeleteVm rejected by FC site policy, requeuing
        vmUrn=urn:sites:<site>:vms:<vm-id>
        errorCode=102212808

This is not an error — the controller treats errorCode 102212808 as recoverable and requeues. The next reconcile re-checks VM status; if running again, the controller re-issues StopVm(safe) and proceeds. Normal recovery time is ~30 seconds.

If you see this errorCode repeatedly (more than 3 cycles), an external actor is racing the controller. Suspend the external automation, or apply cpaas.io/retain-vm to fully take control of VM lifecycle outside the controller.

Use cpaas.io/retain-vm as an Escape Hatch

When the controller cannot finish reconcileDelete for any reason and you need to release the Machine finalizer so the workload cluster can proceed (scale-down, upgrade, removal), annotate either the Machine or the DCSMachine:

kubectl -n <ns> annotate machine <machine-name> cpaas.io/retain-vm=true

Effect:

  • The controller's reconcileDelete recognizes the annotation, skips both ensureVmStopped and DeleteVm, and releases the DCSMachine finalizer.
  • The Machine then transitions out of Deleting and Cluster API can complete the parent operation.

Side effect — important:

  • The VM is not deleted from DCS — it remains on the DCS platform, still consuming compute and storage resources, outside Cluster API's lifecycle management.
  • The IP / hostname slot in DCSIpHostnamePool is also not released — the pool entry will show as still in-use until you manually free it.

After applying retain-vm, plan to manually:

  1. Stop and delete the VM on the DCS portal (or via DCS API).
  2. Free the IP / hostname entry from the corresponding DCSIpHostnamePool.
  3. (Optional) Clean up any persistent disks that were attached.

Verify Recovery

After applying any of the resolutions above, watch the DCSMachine until the condition clears:

kubectl -n <ns> describe dcsmachine <name>
# Look for: Conditions: VMStopPending  Status: True
#           (or the condition disappears as the resource is deleted)

kubectl -n <ns> get machines | grep <machine-name>
# Phase should transition: Deleting → (Machine gone within ~30s)

If the Machine is fully gone and the VM is no longer listed on the DCS portal, recovery is complete.

See Also