Troubleshooting

This guide covers common issues and how to diagnose them.

Checking Operator Logs

The operator logs are the first place to look for any issue.

# Find the operator pod
kubectl get pods -n openclaw-system -l app.kubernetes.io/name=openclaw-operator

# Stream logs
kubectl logs -n openclaw-system -l app.kubernetes.io/name=openclaw-operator -f

# Show logs with increased verbosity
kubectl logs -n openclaw-system -l app.kubernetes.io/name=openclaw-operator --all-containers

Checking Events

Kubernetes events provide a timeline of what happened to your resources.

# Events for a specific instance
kubectl describe openclawinstance my-assistant -n openclaw

# All events in the namespace (sorted by time)
kubectl get events -n openclaw --sort-by='.lastTimestamp'

# Watch events in real time
kubectl get events -n openclaw --watch

Checking Instance Status

# Quick status overview
kubectl get openclawinstance -n openclaw

# Detailed status with conditions
kubectl get openclawinstance my-assistant -n openclaw -o yaml | grep -A 50 'status:'

# Check specific condition
kubectl get openclawinstance my-assistant -n openclaw \
  -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'

Common Issues

Instance Stuck in Pending

Symptoms: The instance stays in Pending phase and never transitions to Provisioning.

Possible causes and solutions:

Operator is not running:
Terminal window
```
kubectl get pods -n openclaw-system
```
Verify the operator pod is Running and ready. If it is in CrashLoopBackOff, check its logs.
CRD not installed or outdated:
Terminal window
```
kubectl get crd openclawinstances.openclaw.rocks
```
If the CRD is missing, install it:
Terminal window
```
kubectl apply -f config/crd/bases/
```
If you upgraded the operator but new fields (e.g. selfConfigure) are rejected as “field not declared in schema”, the CRD is outdated. Upgrade the Helm chart or apply CRDs manually:
Terminal window
```
kubectl apply --server-side -f config/crd/bases/
```
RBAC issues with the operator:
Terminal window
```
kubectl auth can-i get openclawinstances --as=system:serviceaccount:openclaw-system:openclaw-operator -n openclaw
```
Ensure the operator’s ClusterRole has the required permissions. Reinstall the Helm chart if RBAC is missing.
Webhook blocking the request: Check for admission webhook errors in the API server logs or the operator logs. See the Webhook Errors section.

Instance Stuck in Provisioning

Symptoms: The instance transitions to Provisioning but never reaches Running.

Possible causes and solutions:

Resource creation failing silently: Check operator logs for errors:
Terminal window
```
kubectl logs -n openclaw-system deploy/openclaw-operator | grep -i error
```
Resource quota exceeded:
Terminal window
```
kubectl describe resourcequota -n openclaw
```
If quotas are preventing resource creation, either increase quotas or reduce the instance’s resource requests.

Deployment not becoming ready: The reconciler waits for the Deployment to have ready replicas. Check the pod:

kubectl get pods -n openclaw -l app.kubernetes.io/instance=my-assistant
kubectl describe pod -n openclaw -l app.kubernetes.io/instance=my-assistant

Instance in Failed State

Symptoms: The instance phase is Failed. The Ready condition shows status: "False" with a reason.

Diagnosis:

# Check the failure reason
kubectl get openclawinstance my-assistant -n openclaw \
  -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'

# Check events
kubectl describe openclawinstance my-assistant -n openclaw

Common failure reasons:

Image pull errors:
Terminal window
```
kubectl get pods -n openclaw -l app.kubernetes.io/instance=my-assistant -o wide
kubectl describe pod <pod-name> -n openclaw
```
Look for ImagePullBackOff or ErrImagePull. Verify:
- The image repository and tag are correct.
- Pull secrets are configured if using a private registry.
- Network connectivity to the registry.
Insufficient resources:
Terminal window
```
kubectl describe pod <pod-name> -n openclaw | grep -A 5 Events
```
Look for FailedScheduling events. The cluster may not have nodes with enough CPU/memory. Reduce the resource requests or add capacity.
ConfigMap or Secret not found: If using configMapRef, verify the referenced ConfigMap exists:
Terminal window
```
kubectl get configmap <name> -n openclaw
```
If using envFrom with a Secret, verify the Secret exists:
Terminal window
```
kubectl get secret <name> -n openclaw
```

Instance in Degraded State (Skill Packs Unavailable)

Symptoms: The instance phase is Degraded. The SkillPacksReady condition shows status: "False" with reason ResolutionFailed. The instance is running but without skill packs.

Diagnosis:

# Check the SkillPacksReady condition
kubectl get openclawinstance my-assistant -n openclaw \
  -o jsonpath='{.status.conditions[?(@.type=="SkillPacksReady")]}'

# Check events for details
kubectl describe openclawinstance my-assistant -n openclaw | grep SkillPack

Common causes:

GitHub API unreachable: The operator fetches skill packs from GitHub. If GitHub is down or the cluster has no egress access, resolution fails. The instance provisions without skill packs and retries on the next reconcile (30s).
Invalid pack reference: Verify the pack: skill references are valid owner/repo/path[@ref] format:
Terminal window
```
kubectl get openclawinstance my-assistant -n openclaw -o jsonpath='{.spec.skills}'
```
Missing GITHUB_TOKEN: Private skill pack repositories require a GitHub token. Verify the operator has the GITHUB_TOKEN environment variable set.

Resolution: The operator automatically retries skill pack resolution on every reconcile. Once GitHub is reachable again, the instance transitions from Degraded to Running. The operator also uses stale cache - if a previous successful resolution exists, it will use that data even after the cache TTL expires.

NetworkPolicy Blocking Traffic

Symptoms: The instance is Running but cannot reach external APIs or other pods cannot reach the instance.

Diagnosis:

Verify the NetworkPolicy exists:

kubectl get networkpolicy -n openclaw
kubectl describe networkpolicy my-assistant -n openclaw

Instance cannot reach AI APIs: The default NetworkPolicy allows egress to port 443 (HTTPS) and port 53 (DNS). If the AI provider uses a non-standard port, add it to allowedEgressCIDRs or disable the NetworkPolicy temporarily to confirm:
```
spec:
  security:
    networkPolicy:
      allowedEgressCIDRs:
        - "0.0.0.0/0"
```
DNS resolution failing: If allowDNS was set to false, pods cannot resolve hostnames:
```
spec:
  security:
    networkPolicy:
      allowDNS: true
```
Other pods cannot reach the instance: By default, only pods in the same namespace can reach the instance. To allow cross-namespace traffic:
```
spec:
  security:
    networkPolicy:
      allowedIngressNamespaces:
        - ingress-nginx
        - monitoring
```

Verify with a test pod:

kubectl run -n openclaw test-curl --rm -it --image=curlimages/curl -- \
  curl -v http://my-assistant:18789

Instance Stuck in BackingUp Phase

Symptoms: After deleting an instance, it remains in BackingUp phase and is not deleted.

Diagnosis:

# Check the instance status
kubectl get openclawinstance my-agent -o jsonpath='{.status.phase}'
kubectl get openclawinstance my-agent -o jsonpath='{.status.backingUpSince}'

# Check if a backup Job exists and its status
kubectl get jobs -l openclaw.rocks/instance=my-agent
kubectl describe job my-agent-backup -n <namespace>

# Check events for timeout or failure
kubectl describe openclawinstance my-agent | grep -A5 Events

Possible causes and solutions:

Backup timeout will resolve it automatically: By default, the operator waits up to 30 minutes (spec.backup.timeout) before giving up and proceeding with deletion. Check status.backingUpSince to see when the phase started and how much time remains.
Backup Job failed: The Job may have failed due to S3 connectivity issues, incorrect credentials, or insufficient permissions. The operator retries until the timeout elapses. Check the Job logs:
Terminal window
```
kubectl logs job/my-agent-backup -n <namespace>
```
Pods stuck terminating: The StatefulSet was scaled to 0 but pods are stuck. Check for finalizers or PodDisruptionBudgets:
Terminal window
```
kubectl get pods -l openclaw.rocks/instance=my-agent -o yaml | grep finalizers
```
Skip backup immediately: To bypass the backup and delete immediately:
Terminal window
```
kubectl annotate openclawinstance my-agent openclaw.rocks/skip-backup=true
```
Increase or decrease the timeout: Adjust spec.backup.timeout (min: 5m, max: 24h):
```
spec:
  backup:
    timeout: "1h"
```

PVC Not Binding

Symptoms: The pod is stuck in Pending with FailedScheduling or the PVC shows Pending.

Diagnosis:

kubectl get pvc -n openclaw
kubectl describe pvc my-assistant-data -n openclaw

Possible causes:

StorageClass does not exist:
Terminal window
```
kubectl get storageclass
```
Verify the storageClass specified in the spec exists. If omitted, the cluster’s default StorageClass is used.
No capacity available: The storage backend may be out of capacity. Check provisioner logs.
Access mode incompatibility: Some storage backends do not support ReadWriteMany. Use ReadWriteOnce (the default).
Zone mismatch: In multi-zone clusters, PVs may be zone-locked. Ensure nodes and storage are in the same zone, or use a StorageClass that supports multi-zone provisioning.

Webhook Errors

Symptoms: Creating or updating an OpenClawInstance fails with a webhook error, such as failed calling webhook or connection refused.

Diagnosis:

Webhook not enabled: The webhook is optional. Verify it is configured:

kubectl get validatingwebhookconfigurations | grep openclaw
kubectl get mutatingwebhookconfigurations | grep openclaw

cert-manager not installed or certificate not ready: The webhook requires TLS certificates. If using cert-manager:
Terminal window
```
kubectl get certificate -n openclaw-system
kubectl describe certificate -n openclaw-system <cert-name>
```

Webhook Service not reachable: Verify the webhook Service and its endpoints:

kubectl get svc -n openclaw-system | grep webhook
kubectl get endpoints -n openclaw-system | grep webhook

Bypass the webhook temporarily: If the webhook is misconfigured and blocking all operations, delete the webhook configuration:
Terminal window
```
kubectl delete validatingwebhookconfiguration openclaw-operator-validating-webhook
kubectl delete mutatingwebhookconfiguration openclaw-operator-mutating-webhook
```
Then fix the underlying issue and redeploy.

Ingress Not Working

Symptoms: The Ingress is created but traffic does not reach the instance.

Diagnosis:

IngressClass not found:
Terminal window
```
kubectl get ingressclass
```
Verify the className in the spec matches an installed IngressClass.
Ingress controller not installed: An Ingress resource does nothing without a controller (nginx-ingress, Traefik, etc.):
Terminal window
```
kubectl get pods -n ingress-nginx
```
TLS Secret missing: If TLS is configured, the referenced Secret must exist:
Terminal window
```
kubectl get secret <secretName> -n openclaw
```
DNS not pointing to the Ingress: Verify DNS resolution for the configured host:
Terminal window
```
nslookup openclaw.example.com
```
The DNS should point to the Ingress controller’s external IP or load balancer.
NetworkPolicy blocking the Ingress controller: If NetworkPolicy is enabled, the Ingress controller’s namespace must be in allowedIngressNamespaces:
```
spec:
  security:
    networkPolicy:
      allowedIngressNamespaces:
        - ingress-nginx
```
WebSocket connectivity: The operator automatically adds WebSocket-related nginx annotations. If using a different Ingress controller, you may need to add controller-specific annotations for WebSocket support.

Control UI Shows “device identity required”

Symptoms: Connecting to the Control UI through an Ingress fails with code=1008 reason=device identity required in the OpenClaw logs.

Possible causes and solutions:

gateway.mode: local is set in the config: This mode enforces browser-based device identity verification, which is incompatible with Kubernetes. Remove gateway.mode from your CR’s spec.config.raw - the operator defaults to server mode which is correct for K8s.
Stale config from merge mode: If you previously had gateway.mode: local in your config and are using mergeMode: merge, the old key persists on the PVC even after removing it from the CR. Temporarily set mergeMode: replace to wipe stale keys:
```
spec:
  config:
    mergeMode: replace  # temporarily set, then switch back to merge
```
Upstream OpenClaw bug: Even with dangerouslyDisableDeviceAuth: true (which the operator injects automatically), some OpenClaw versions still enforce device identity. Workaround: Pass the gateway token directly in the URL fragment:
```
https://openclaw.example.com/#token=<your-gateway-token>
```
You can find the token in the auto-generated Secret:
Terminal window
```
kubectl get secret <instance>-gateway-token -n <namespace> -o jsonpath='{.data.token}' | base64 -d
```

Gateway Proxy “Connection Refused” on Startup

Symptoms: The gateway-proxy (nginx) container logs show connect() failed (111: Connection refused) immediately after pod startup.

This is expected and harmless. The nginx proxy sidecar starts before the OpenClaw gateway is fully listening. The connection refused errors resolve within a few seconds once the gateway binds to its port. No action is needed - subsequent connections will succeed.

Chromium Sidecar Issues

Symptoms: The Chromium sidecar is not starting, crashing, or browser automation fails.

Diagnosis:

Check sidecar status:

kubectl get pods -n openclaw -l app.kubernetes.io/instance=my-assistant -o json | \
  jq '.items[0].status.containerStatuses[] | select(.name=="chromium")'

Check sidecar logs:

kubectl logs -n openclaw <pod-name> -c chromium

Insufficient shared memory (/dev/shm): Chromium requires shared memory. The operator mounts a 256Mi memory-backed emptyDir at /dev/shm. If Chromium crashes with memory errors, increase the sidecar’s memory limit:
```
spec:
  chromium:
    resources:
      limits:
        memory: 4Gi
```

Insufficient resources: Chromium is resource-intensive. The defaults (250m CPU, 512Mi memory request) may not be enough for heavy workloads. Increase the limits:

spec:
  chromium:
    resources:
      requests:
        cpu: "1"
        memory: 2Gi
      limits:
        cpu: "2"
        memory: 4Gi

Security context restrictions: The Chromium sidecar runs as UID 1001 with a read-only root filesystem and all capabilities dropped. Some Kubernetes environments (e.g., OpenShift) may impose additional restrictions. Check for SecurityContextConstraint (SCC) violations:
Terminal window
```
kubectl describe pod <pod-name> -n openclaw | grep -i security
```

Operator CrashLoopBackOff

Symptoms: The operator pod itself is restarting repeatedly.

Diagnosis:

kubectl logs -n openclaw-system deploy/openclaw-operator --previous
kubectl describe pod -n openclaw-system -l app.kubernetes.io/name=openclaw-operator

Common causes:

Leader election failure: If another instance holds the leader lock, check for stale leases:
Terminal window
```
kubectl get lease -n openclaw-system
```
Missing CRD: If the CRD is not installed, the controller fails to start:
Terminal window
```
kubectl get crd openclawinstances.openclaw.rocks
```
Insufficient RBAC: The operator needs cluster-wide permissions for certain resources. Verify the ClusterRole and ClusterRoleBinding are in place.
Webhook certificate issues: If the webhook is enabled but certificates are not provisioned, the server fails to start.

Useful Commands Reference

# List all OpenClaw instances across namespaces
kubectl get openclawinstance -A

# Get managed resources for an instance
kubectl get openclawinstance my-assistant -n openclaw \
  -o jsonpath='{.status.managedResources}' | jq .

# Check if the operator can reach the API server
kubectl logs -n openclaw-system deploy/openclaw-operator | head -20

# Force a reconciliation by adding an annotation
kubectl annotate openclawinstance my-assistant -n openclaw \
  force-reconcile=$(date +%s) --overwrite

# Check Prometheus metrics from the operator
kubectl port-forward -n openclaw-system svc/openclaw-operator-metrics 8443:8443
# Then: curl -k https://localhost:8443/metrics

# Dump full instance status
kubectl get openclawinstance my-assistant -n openclaw -o yaml