Skip to main content
Version: v2.3.4

Enterprise Operator Crashloop (Missing gcp-storage Crossplane Endpoints)

Overview

The enterprise-operator pod enters a crashloop state when the GCP storage crossplane provider pod is scaled down to 0 replicas. This causes conversion webhook failures for BucketIAMMember.storage.gcp.upbound.io resources.

Symptoms

  • enterprise-operator pod crashlooping
  • Error in logs: no endpoints available for service "gcp-storage"
  • appconfigs not syncing
  • Conversion webhook failures for storage.gcp.upbound.io/v1beta1

Error Message

failed to load initial state of resource BucketIAMMember.storage.gcp.upbound.io:
conversion webhook for storage.gcp.upbound.io/v1beta1, Kind=BucketIAMMember failed:
Post "https://gcp-storage.crossplane.svc:9443/convert?timeout=30s":
no endpoints available for service "gcp-storage"

Root Cause

The gcp-storage crossplane provider pod was intentionally scaled to 0 replicas (typically due to operational requirements like the Citadel soft deletion retention rule >= 8 days). When BucketIAM member events trigger conversions, the enterprise-operator cannot reach the webhook endpoint, causing it to fail and crashloop.

Impact

  • AppConfigs fail to sync
  • Potential impact on heartbeats and communications to mothership
  • Critical: Gateway client certificate rotation may fail if enterprise-operator is down (check certificate expiry with the diagnostic command below)

Prerequisites

  • kubectl access to the cluster
  • Access to the crossplane namespace
  • Permissions to edit DeploymentRuntimeConfig resources

Diagnostic Steps

1. Verify the crashloop

Check enterprise-operator logs:

kubectl logs -n alphasense <enterprise-operator-pod> | grep "gcp-storage"

2. Check gcp-storage service endpoints

kubectl -n crossplane get svc gcp-storage -o yaml

Look for the selector:

spec:
selector:
pkg.crossplane.io/provider: provider-gcp-storage

3. Check if gcp-storage pods are running

kubectl -n crossplane get pods | grep gcp-storage

If no pods are returned, the provider is scaled to 0.

4. Check gateway certificate expiry (Critical)

kubectl -n mothership get secret gateway-client-certificate -ojson | jq -r '.data."tls.crt"' | base64 -d | openssl x509 -text -noout -in /dev/stdin | grep "Not After"

If the certificate expires soon (3 days), the enterprise-operator must be running to rotate it.

5. Check the DeploymentRuntimeConfig

kubectl get DeploymentRuntimeConfig gcp-upbound-storage -o yaml

Look for the replicas field in spec.deploymentTemplate.spec.replicas.

Resolution

Step 1: Get the current DeploymentRuntimeConfig

kubectl get DeploymentRuntimeConfig gcp-upbound-storage -o yaml > gcp-upbound-storage-backup.yaml

Step 2: Edit the DeploymentRuntimeConfig

kubectl edit DeploymentRuntimeConfig gcp-upbound-storage

Step 3: Change replicas: 0 to replicas: 1

spec:
deploymentTemplate:
spec:
replicas: 1 # Change from 0 to 1

Step 4: Verify the pod is running

kubectl -n crossplane get pods | grep gcp-storage

Wait until the pod shows 1/1 Running.

Step 5: Verify enterprise-operator recovers

kubectl logs -n alphasense <enterprise-operator-pod> --tail=50

You should see successful sync messages.

Step 6: Verify heartbeat to mothership - Only for Alphasense

Contact Alphasense Support to monitor heartbeat. Check monitoring/logs for successful heartbeat messages after the fix.