Troubleshooting
This guide covers common issues and how to diagnose them.
Checking Operator Logs
Section titled “Checking Operator Logs”The operator logs are the first place to look for any issue.
# Find the operator podkubectl get pods -n openclaw-system -l app.kubernetes.io/name=openclaw-operator
# Stream logskubectl logs -n openclaw-system -l app.kubernetes.io/name=openclaw-operator -f
# Show logs with increased verbositykubectl logs -n openclaw-system -l app.kubernetes.io/name=openclaw-operator --all-containersChecking Events
Section titled “Checking Events”Kubernetes events provide a timeline of what happened to your resources.
# Events for a specific instancekubectl describe openclawinstance my-assistant -n openclaw
# All events in the namespace (sorted by time)kubectl get events -n openclaw --sort-by='.lastTimestamp'
# Watch events in real timekubectl get events -n openclaw --watchChecking Instance Status
Section titled “Checking Instance Status”# Quick status overviewkubectl get openclawinstance -n openclaw
# Detailed status with conditionskubectl get openclawinstance my-assistant -n openclaw -o yaml | grep -A 50 'status:'
# Check specific conditionkubectl get openclawinstance my-assistant -n openclaw \ -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'Common Issues
Section titled “Common Issues”Instance Stuck in Pending
Section titled “Instance Stuck in Pending”Symptoms: The instance stays in Pending phase and never transitions to Provisioning.
Possible causes and solutions:
-
Operator is not running:
Terminal window kubectl get pods -n openclaw-systemVerify the operator pod is
Runningand ready. If it is inCrashLoopBackOff, check its logs. -
CRD not installed or outdated:
Terminal window kubectl get crd openclawinstances.openclaw.rocksIf the CRD is missing, install it:
Terminal window kubectl apply -f config/crd/bases/If you upgraded the operator but new fields (e.g.
selfConfigure) are rejected as “field not declared in schema”, the CRD is outdated. Upgrade the Helm chart or apply CRDs manually:Terminal window kubectl apply --server-side -f config/crd/bases/ -
RBAC issues with the operator:
Terminal window kubectl auth can-i get openclawinstances --as=system:serviceaccount:openclaw-system:openclaw-operator -n openclawEnsure the operator’s ClusterRole has the required permissions. Reinstall the Helm chart if RBAC is missing.
-
Webhook blocking the request: Check for admission webhook errors in the API server logs or the operator logs. See the Webhook Errors section.
Instance Stuck in Provisioning
Section titled “Instance Stuck in Provisioning”Symptoms: The instance transitions to Provisioning but never reaches Running.
Possible causes and solutions:
-
Resource creation failing silently: Check operator logs for errors:
Terminal window kubectl logs -n openclaw-system deploy/openclaw-operator | grep -i error -
Resource quota exceeded:
Terminal window kubectl describe resourcequota -n openclawIf quotas are preventing resource creation, either increase quotas or reduce the instance’s resource requests.
-
Deployment not becoming ready: The reconciler waits for the Deployment to have ready replicas. Check the pod:
Terminal window kubectl get pods -n openclaw -l app.kubernetes.io/instance=my-assistantkubectl describe pod -n openclaw -l app.kubernetes.io/instance=my-assistant
Instance in Failed State
Section titled “Instance in Failed State”Symptoms: The instance phase is Failed. The Ready condition shows status: "False" with a reason.
Diagnosis:
# Check the failure reasonkubectl get openclawinstance my-assistant -n openclaw \ -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'
# Check eventskubectl describe openclawinstance my-assistant -n openclawCommon failure reasons:
-
Image pull errors:
Terminal window kubectl get pods -n openclaw -l app.kubernetes.io/instance=my-assistant -o widekubectl describe pod <pod-name> -n openclawLook for
ImagePullBackOfforErrImagePull. Verify:- The image repository and tag are correct.
- Pull secrets are configured if using a private registry.
- Network connectivity to the registry.
-
Insufficient resources:
Terminal window kubectl describe pod <pod-name> -n openclaw | grep -A 5 EventsLook for
FailedSchedulingevents. The cluster may not have nodes with enough CPU/memory. Reduce the resource requests or add capacity. -
ConfigMap or Secret not found: If using
configMapRef, verify the referenced ConfigMap exists:Terminal window kubectl get configmap <name> -n openclawIf using
envFromwith a Secret, verify the Secret exists:Terminal window kubectl get secret <name> -n openclaw
Instance in Degraded State (Skill Packs Unavailable)
Section titled “Instance in Degraded State (Skill Packs Unavailable)”Symptoms: The instance phase is Degraded. The SkillPacksReady condition shows status: "False" with reason ResolutionFailed. The instance is running but without skill packs.
Diagnosis:
# Check the SkillPacksReady conditionkubectl get openclawinstance my-assistant -n openclaw \ -o jsonpath='{.status.conditions[?(@.type=="SkillPacksReady")]}'
# Check events for detailskubectl describe openclawinstance my-assistant -n openclaw | grep SkillPackCommon causes:
-
GitHub API unreachable: The operator fetches skill packs from GitHub. If GitHub is down or the cluster has no egress access, resolution fails. The instance provisions without skill packs and retries on the next reconcile (30s).
-
Invalid pack reference: Verify the
pack:skill references are validowner/repo/path[@ref]format:Terminal window kubectl get openclawinstance my-assistant -n openclaw -o jsonpath='{.spec.skills}' -
Missing GITHUB_TOKEN: Private skill pack repositories require a GitHub token. Verify the operator has the
GITHUB_TOKENenvironment variable set.
Resolution: The operator automatically retries skill pack resolution on every reconcile. Once GitHub is reachable again, the instance transitions from Degraded to Running. The operator also uses stale cache - if a previous successful resolution exists, it will use that data even after the cache TTL expires.
NetworkPolicy Blocking Traffic
Section titled “NetworkPolicy Blocking Traffic”Symptoms: The instance is Running but cannot reach external APIs or other pods cannot reach the instance.
Diagnosis:
-
Verify the NetworkPolicy exists:
Terminal window kubectl get networkpolicy -n openclawkubectl describe networkpolicy my-assistant -n openclaw -
Instance cannot reach AI APIs: The default NetworkPolicy allows egress to port 443 (HTTPS) and port 53 (DNS). If the AI provider uses a non-standard port, add it to
allowedEgressCIDRsor disable the NetworkPolicy temporarily to confirm:spec:security:networkPolicy:allowedEgressCIDRs:- "0.0.0.0/0" -
DNS resolution failing: If
allowDNSwas set tofalse, pods cannot resolve hostnames:spec:security:networkPolicy:allowDNS: true -
Other pods cannot reach the instance: By default, only pods in the same namespace can reach the instance. To allow cross-namespace traffic:
spec:security:networkPolicy:allowedIngressNamespaces:- ingress-nginx- monitoring -
Verify with a test pod:
Terminal window kubectl run -n openclaw test-curl --rm -it --image=curlimages/curl -- \curl -v http://my-assistant:18789
Instance Stuck in BackingUp Phase
Section titled “Instance Stuck in BackingUp Phase”Symptoms: After deleting an instance, it remains in BackingUp phase and is not deleted.
Diagnosis:
# Check the instance statuskubectl get openclawinstance my-agent -o jsonpath='{.status.phase}'kubectl get openclawinstance my-agent -o jsonpath='{.status.backingUpSince}'
# Check if a backup Job exists and its statuskubectl get jobs -l openclaw.rocks/instance=my-agentkubectl describe job my-agent-backup -n <namespace>
# Check events for timeout or failurekubectl describe openclawinstance my-agent | grep -A5 EventsPossible causes and solutions:
-
Backup timeout will resolve it automatically: By default, the operator waits up to 30 minutes (
spec.backup.timeout) before giving up and proceeding with deletion. Checkstatus.backingUpSinceto see when the phase started and how much time remains. -
Backup Job failed: The Job may have failed due to S3 connectivity issues, incorrect credentials, or insufficient permissions. The operator retries until the timeout elapses. Check the Job logs:
Terminal window kubectl logs job/my-agent-backup -n <namespace> -
Pods stuck terminating: The StatefulSet was scaled to 0 but pods are stuck. Check for finalizers or PodDisruptionBudgets:
Terminal window kubectl get pods -l openclaw.rocks/instance=my-agent -o yaml | grep finalizers -
Skip backup immediately: To bypass the backup and delete immediately:
Terminal window kubectl annotate openclawinstance my-agent openclaw.rocks/skip-backup=true -
Increase or decrease the timeout: Adjust
spec.backup.timeout(min: 5m, max: 24h):spec:backup:timeout: "1h"
PVC Not Binding
Section titled “PVC Not Binding”Symptoms: The pod is stuck in Pending with FailedScheduling or the PVC shows Pending.
Diagnosis:
kubectl get pvc -n openclawkubectl describe pvc my-assistant-data -n openclawPossible causes:
-
StorageClass does not exist:
Terminal window kubectl get storageclassVerify the
storageClassspecified in the spec exists. If omitted, the cluster’s default StorageClass is used. -
No capacity available: The storage backend may be out of capacity. Check provisioner logs.
-
Access mode incompatibility: Some storage backends do not support
ReadWriteMany. UseReadWriteOnce(the default). -
Zone mismatch: In multi-zone clusters, PVs may be zone-locked. Ensure nodes and storage are in the same zone, or use a StorageClass that supports multi-zone provisioning.
Webhook Errors
Section titled “Webhook Errors”Symptoms: Creating or updating an OpenClawInstance fails with a webhook error, such as failed calling webhook or connection refused.
Diagnosis:
-
Webhook not enabled: The webhook is optional. Verify it is configured:
Terminal window kubectl get validatingwebhookconfigurations | grep openclawkubectl get mutatingwebhookconfigurations | grep openclaw -
cert-manager not installed or certificate not ready: The webhook requires TLS certificates. If using cert-manager:
Terminal window kubectl get certificate -n openclaw-systemkubectl describe certificate -n openclaw-system <cert-name> -
Webhook Service not reachable: Verify the webhook Service and its endpoints:
Terminal window kubectl get svc -n openclaw-system | grep webhookkubectl get endpoints -n openclaw-system | grep webhook -
Bypass the webhook temporarily: If the webhook is misconfigured and blocking all operations, delete the webhook configuration:
Terminal window kubectl delete validatingwebhookconfiguration openclaw-operator-validating-webhookkubectl delete mutatingwebhookconfiguration openclaw-operator-mutating-webhookThen fix the underlying issue and redeploy.
Ingress Not Working
Section titled “Ingress Not Working”Symptoms: The Ingress is created but traffic does not reach the instance.
Diagnosis:
-
IngressClass not found:
Terminal window kubectl get ingressclassVerify the
classNamein the spec matches an installed IngressClass. -
Ingress controller not installed: An Ingress resource does nothing without a controller (nginx-ingress, Traefik, etc.):
Terminal window kubectl get pods -n ingress-nginx -
TLS Secret missing: If TLS is configured, the referenced Secret must exist:
Terminal window kubectl get secret <secretName> -n openclaw -
DNS not pointing to the Ingress: Verify DNS resolution for the configured host:
Terminal window nslookup openclaw.example.comThe DNS should point to the Ingress controller’s external IP or load balancer.
-
NetworkPolicy blocking the Ingress controller: If NetworkPolicy is enabled, the Ingress controller’s namespace must be in
allowedIngressNamespaces:spec:security:networkPolicy:allowedIngressNamespaces:- ingress-nginx -
WebSocket connectivity: The operator automatically adds WebSocket-related nginx annotations. If using a different Ingress controller, you may need to add controller-specific annotations for WebSocket support.
Control UI Shows “device identity required”
Section titled “Control UI Shows “device identity required””Symptoms: Connecting to the Control UI through an Ingress fails with code=1008 reason=device identity required in the OpenClaw logs.
Possible causes and solutions:
-
gateway.mode: localis set in the config: This mode enforces browser-based device identity verification, which is incompatible with Kubernetes. Removegateway.modefrom your CR’sspec.config.raw- the operator defaults to server mode which is correct for K8s. -
Stale config from merge mode: If you previously had
gateway.mode: localin your config and are usingmergeMode: merge, the old key persists on the PVC even after removing it from the CR. Temporarily setmergeMode: replaceto wipe stale keys:spec:config:mergeMode: replace # temporarily set, then switch back to merge -
Upstream OpenClaw bug: Even with
dangerouslyDisableDeviceAuth: true(which the operator injects automatically), some OpenClaw versions still enforce device identity. Workaround: Pass the gateway token directly in the URL fragment:https://openclaw.example.com/#token=<your-gateway-token>You can find the token in the auto-generated Secret:
Terminal window kubectl get secret <instance>-gateway-token -n <namespace> -o jsonpath='{.data.token}' | base64 -d
Gateway Proxy “Connection Refused” on Startup
Section titled “Gateway Proxy “Connection Refused” on Startup”Symptoms: The gateway-proxy (nginx) container logs show connect() failed (111: Connection refused) immediately after pod startup.
This is expected and harmless. The nginx proxy sidecar starts before the OpenClaw gateway is fully listening. The connection refused errors resolve within a few seconds once the gateway binds to its port. No action is needed - subsequent connections will succeed.
Chromium Sidecar Issues
Section titled “Chromium Sidecar Issues”Symptoms: The Chromium sidecar is not starting, crashing, or browser automation fails.
Diagnosis:
-
Check sidecar status:
Terminal window kubectl get pods -n openclaw -l app.kubernetes.io/instance=my-assistant -o json | \jq '.items[0].status.containerStatuses[] | select(.name=="chromium")' -
Check sidecar logs:
Terminal window kubectl logs -n openclaw <pod-name> -c chromium -
Insufficient shared memory (
/dev/shm): Chromium requires shared memory. The operator mounts a 256Mi memory-backed emptyDir at/dev/shm. If Chromium crashes with memory errors, increase the sidecar’s memory limit:spec:chromium:resources:limits:memory: 4Gi -
Insufficient resources: Chromium is resource-intensive. The defaults (250m CPU, 512Mi memory request) may not be enough for heavy workloads. Increase the limits:
spec:chromium:resources:requests:cpu: "1"memory: 2Gilimits:cpu: "2"memory: 4Gi -
Security context restrictions: The Chromium sidecar runs as UID 1001 with a read-only root filesystem and all capabilities dropped. Some Kubernetes environments (e.g., OpenShift) may impose additional restrictions. Check for SecurityContextConstraint (SCC) violations:
Terminal window kubectl describe pod <pod-name> -n openclaw | grep -i security
Operator CrashLoopBackOff
Section titled “Operator CrashLoopBackOff”Symptoms: The operator pod itself is restarting repeatedly.
Diagnosis:
kubectl logs -n openclaw-system deploy/openclaw-operator --previouskubectl describe pod -n openclaw-system -l app.kubernetes.io/name=openclaw-operatorCommon causes:
-
Leader election failure: If another instance holds the leader lock, check for stale leases:
Terminal window kubectl get lease -n openclaw-system -
Missing CRD: If the CRD is not installed, the controller fails to start:
Terminal window kubectl get crd openclawinstances.openclaw.rocks -
Insufficient RBAC: The operator needs cluster-wide permissions for certain resources. Verify the ClusterRole and ClusterRoleBinding are in place.
-
Webhook certificate issues: If the webhook is enabled but certificates are not provisioned, the server fails to start.
Useful Commands Reference
Section titled “Useful Commands Reference”# List all OpenClaw instances across namespaceskubectl get openclawinstance -A
# Get managed resources for an instancekubectl get openclawinstance my-assistant -n openclaw \ -o jsonpath='{.status.managedResources}' | jq .
# Check if the operator can reach the API serverkubectl logs -n openclaw-system deploy/openclaw-operator | head -20
# Force a reconciliation by adding an annotationkubectl annotate openclawinstance my-assistant -n openclaw \ force-reconcile=$(date +%s) --overwrite
# Check Prometheus metrics from the operatorkubectl port-forward -n openclaw-system svc/openclaw-operator-metrics 8443:8443# Then: curl -k https://localhost:8443/metrics
# Dump full instance statuskubectl get openclawinstance my-assistant -n openclaw -o yaml