Version: v2.4.1

Enterprise Operator Crashloop (Missing gcp-storage Crossplane Endpoints)

Overview

The enterprise-operator pod enters a crashloop state when the GCP storage crossplane provider pod is scaled down to 0 replicas. This causes conversion webhook failures for BucketIAMMember.storage.gcp.upbound.io resources.

Symptoms

enterprise-operator pod crashlooping
Error in logs: no endpoints available for service "gcp-storage"
appconfigs not syncing
Conversion webhook failures for storage.gcp.upbound.io/v1beta1

Error Message

failed to load initial state of resource BucketIAMMember.storage.gcp.upbound.io:
conversion webhook for storage.gcp.upbound.io/v1beta1, Kind=BucketIAMMember failed:
Post "https://gcp-storage.crossplane.svc:9443/convert?timeout=30s":
no endpoints available for service "gcp-storage"

Root Cause

The gcp-storage crossplane provider pod was intentionally scaled to 0 replicas (typically due to operational requirements like the Citadel soft deletion retention rule >= 8 days). When BucketIAM member events trigger conversions, the enterprise-operator cannot reach the webhook endpoint, causing it to fail and crashloop.

Impact

AppConfigs fail to sync
Potential impact on heartbeats and communications to mothership
Critical: Gateway client certificate rotation may fail if enterprise-operator is down (check certificate expiry with the diagnostic command below)

Prerequisites

kubectl access to the cluster
Access to the crossplane namespace
Permissions to edit DeploymentRuntimeConfig resources

Diagnostic Steps

1. Verify the crashloop

Check enterprise-operator logs:

kubectl logs -n alphasense <enterprise-operator-pod> | grep "gcp-storage"

2. Check gcp-storage service endpoints

kubectl -n crossplane get svc gcp-storage -o yaml

Look for the selector:

spec:
  selector:
    pkg.crossplane.io/provider: provider-gcp-storage

3. Check if gcp-storage pods are running

kubectl -n crossplane get pods | grep gcp-storage

If no pods are returned, the provider is scaled to 0.

4. Check gateway certificate expiry (Critical)

kubectl -n mothership get secret gateway-client-certificate -ojson | jq -r '.data."tls.crt"' | base64 -d | openssl x509 -text -noout -in /dev/stdin | grep "Not After"

If the certificate expires soon (3 days), the enterprise-operator must be running to rotate it.

5. Check the DeploymentRuntimeConfig

kubectl get DeploymentRuntimeConfig gcp-upbound-storage -o yaml

Look for the replicas field in spec.deploymentTemplate.spec.replicas.

Resolution

Option 1: Re-enable the gcp-storage crossplane provider (Recommended)

Step 1: Get the current DeploymentRuntimeConfig

kubectl get DeploymentRuntimeConfig gcp-upbound-storage -o yaml > gcp-upbound-storage-backup.yaml

Step 2: Edit the DeploymentRuntimeConfig

kubectl edit DeploymentRuntimeConfig gcp-upbound-storage

Step 3: Change replicas: 0 to replicas: 1

spec:
  deploymentTemplate:
    spec:
      replicas: 1 # Change from 0 to 1

Step 4: Verify the pod is running

kubectl -n crossplane get pods | grep gcp-storage

Wait until the pod shows 1/1 Running.

Step 5: Verify enterprise-operator recovers

kubectl logs -n alphasense <enterprise-operator-pod> --tail=50

You should see successful sync messages.

Step 6: Verify heartbeat to mothership - Only for Alphasense

Contact Alphasense Support to monitor heartbeat. Check monitoring/logs for successful heartbeat messages after the fix.

Overview​

Symptoms​

Error Message​

Root Cause​

Impact​

Prerequisites​

Diagnostic Steps​

1. Verify the crashloop​

2. Check gcp-storage service endpoints​

3. Check if gcp-storage pods are running​

4. Check gateway certificate expiry (Critical)​

5. Check the DeploymentRuntimeConfig​

Resolution​

Option 1: Re-enable the gcp-storage crossplane provider (Recommended)​