Operations — CloudCondom / Phantom

Deployment Complexity Assessment

Installation Difficulty: HIGH (7/10)

Phantom is not a "helm install and go" product. It's a multi-component system with an external dependency (OpenBao) that must be operational before anything else works.

Minimum Installation Steps

Provision and configure OpenBao cluster in the EU — 3+ node Raft cluster, TLS certificates, auth methods, policies. This alone is a 2–5 day project for an experienced Vault/OpenBao operator. For someone unfamiliar, 1–2 weeks.
Ensure egress connectivity from the target K8s cluster to OpenBao — Cloud NAT (GKE), NAT Gateway (EKS), outbound rules (AKS). On private clusters this is a networking project involving infrastructure teams, firewall rules, and potentially VPN/peering.
Install the Phantom operator via Helm — the easiest part, but still requires configuring OpenBao endpoints, auth credentials, TLS trust, namespace policies, resource limits.
Label namespaces and workloads — decide protection scope, configure canary vs stable channels.
Validate — run pre-flight checks, verify sidecar injection, confirm secret delivery end-to-end.

Per-Provider Deployment Differences

Step	GKE Standard	GKE Autopilot	EKS	EKS Fargate	AKS
Egress setup	Cloud NAT	Cloud NAT	NAT Gateway	NAT Gateway	NAT GW / LB rules (mandatory after Mar 2026)
Webhook port	Must be 443 (private clusters)	Must be 443	Any	Any	Any
Sidecar QoS	Best practice: Guaranteed	Required: Guaranteed	Flexible	Flexible (no privileged)	Flexible
CSI (Cloakfs)	Full support	Blocked (partner list)	Full support	No DaemonSets	Unstable (removed on upgrade)
eBPF DaemonSet	Supported	Restricted	Supported	No DaemonSets	Supported
Confidential computing	Full SEV-SNP/TDX	Limited	Nitro Enclaves only	Not available	CVM node pools (Kata CC sunsetting)

No Single Deployment Path

Each provider needs its own runbook, and GKE Autopilot and EKS Fargate are significantly constrained environments where multiple features simply don't work.

Prerequisites & Dependencies

Prerequisite	Complexity	Who Owns It
OpenBao cluster (EU, HA)	HIGH — requires Vault/OpenBao expertise	Customer or managed service
Network egress to OpenBao	MEDIUM — involves infra/networking team	Customer
TLS certificates for OpenBao	MEDIUM — PKI management	Customer
Kernel version >= 5.10 (eBPF)	LOW — constrains node OS choices	Customer
Namespace labeling strategy	LOW — requires planning	Customer
Prometheus/Grafana stack	LOW — likely already exists	Customer

Time-to-Value

Customer Profile	Time to First Protected Workload
Platform team with Vault experience, existing monitoring	3–5 days
Platform team without Vault experience	1–3 weeks
Small team, first K8s operator experience	3–6 weeks
Enterprise with change management processes	6–12 weeks (including approvals)

Deployment Bottleneck

Time-to-value is dominated by OpenBao setup and network configuration, not by Phantom itself. The external trust anchor is both the product's strength and its deployment bottleneck.

Multi-Provider Feature Matrix

Feature Availability

Feature	GKE Standard	GKE Autopilot	EKS (EC2)	EKS Fargate	AKS
Webhook injection	Yes	Yes (with exclusions)	Yes	Yes	Yes (with exclusions)
Sidecar injection	Yes	Yes (Guaranteed QoS)	Yes	Yes (unprivileged)	Yes
eBPF monitoring	Yes	Restricted	Yes (AL2023) / Degraded (AL2)	No	Yes
Cloakfs CSI	Yes	No	Yes	No	Unreliable
Confidential computing	Full (SEV-SNP + TDX)	Limited	Nitro only	No	CVM pools
Pre-flight check	Yes	Yes	Yes	Yes	Yes
External OpenBao	Requires Cloud NAT	Requires Cloud NAT	Requires NAT GW	Requires NAT GW	Requires NAT GW/LB
Canary injection	Yes	Yes	Yes	Yes	Yes

Feature Degradation Summary

Provider + Mode	Phantom Core	eBPF	Cloakfs	Specter	Net Score
GKE Standard	Full	Full	Full	Full	100%
GKE Autopilot	Full (constrained)	Degraded	None	Limited	55%
EKS (EC2, AL2023)	Full	Full	Full	Different arch	75%
EKS (EC2, AL2)	Full	Degraded	Full	Different arch	65%
EKS Fargate	Full (unprivileged)	None	None	None	35%
AKS	Full	Full	Unreliable	Degraded (Kata sunsetting)	65%

Only GKE Standard provides the full feature set

Every other provider/mode requires feature flags, degraded capabilities, or entirely different architectures. This means documentation must be provider-specific, testing must cover every combination, and marketing must be careful not to promise capabilities that only work on one provider.

Operational Differences Per Provider

Operational Task	GKE	EKS	AKS
Install webhook	Watch port 443, Cloud NAT	Straightforward	Watch Admissions Enforcer
Upgrade operator	Standard rolling update	Standard rolling update	CSI driver may be removed
eBPF troubleshooting	Full tooling (COS has `bpftool`)	AL2: limited BTF. AL2023: full	Full tooling
Node OS upgrade	Auto-upgrade (may break eBPF)	Managed node group update	Node image + K8s version are separate
Marketplace update	Push to Artifact Registry	No hooks, no lookup (EKS add-on)	Bundle all images (CNAB)

Day-2 Operations

Upgrade/Rollback Complexity: MODERATE-HIGH

What's Good

Canary injection via namespace labels is a sound pattern
Operator's RollingUpdate with maxUnavailable: 0 and leader election is standard practice
N-1 backward compatibility between operator and sidecars is the right approach

What's Concerning

Mixed sidecar versions unavoidable — long-running pods keep injected sidecars until restarted. N-1 becomes N-3 in practice.
CRD conversion webhooks add another failure mode in the same failure domain as admission webhook.
Rollback isn't truly instant — changing a label only affects newly created pods.
OpenBao upgrades are out of band — no coordination with Phantom upgrades.

Monitoring & Alerting Requirements

Metric / Alert	Purpose	Criticality
`phantom_secret_age_seconds`	Secret freshness SLI	P1 — core SLO metric
`phantom_secret_fetch_errors_total`	OpenBao connectivity health	P1 — early warning
`phantom_sidecar_version`	Version distribution during upgrades	P2 — operational awareness
`PhantomWebhookDegraded`	Circuit breaker opened	P1 — pods stuck Pending
`PhantomSecretStale`	Serving stale secrets in degraded mode	P1 — security posture degraded
OpenBao health (`/sys/health`)	External dependency health	P1 — upstream dependency
Webhook latency (p99)	Performance regression	P2 — user experience
eBPF program CPU overhead	Kernel-level performance impact	P2 — node health

Significant Monitoring Surface

Customers need a working Prometheus + AlertManager stack, Grafana dashboards, and on-call processes for at least 5–6 P1 alerts. This isn't unusual for a security-critical component, but it raises the operational bar.

Incident Response Scenarios

Scenario 1: Webhook Down (Circuit Breaker Open)

Impact: All new pods in protected namespaces stuck Pending.
Detection: PhantomWebhookDegraded alert.
Response: Check operator pod health, restart if needed. If prolonged, use phantom.io/circuit-breaker: "bypass" on critical namespaces (security trade-off — documented and auditable).
Recovery time: Minutes (if operator pod restarts) to hours (if underlying issue is deeper).

Scenario 2: OpenBao Unreachable

Impact: New pods can't fetch secrets. Existing pods serve from cache (5 min hot, 1 hr sealed, 15 min grace).
Critical window: 1 hour 20 minutes (1hr sealed cache + 15min grace). After this, workloads start failing. Workloads that restart during outage have only the 15min grace period.
Recovery time: Depends entirely on OpenBao recovery — could be minutes or hours.

Scenario 3: eBPF Program Causing Node Instability

Impact: Latency spikes, CPU overhead on affected nodes.
Response: Detach eBPF programs (auto-detach threshold should handle this). If not, drain the node and disable eBPF feature flag.
Recovery time: Minutes if auto-detach works. Hours if manual intervention needed.

Scenario 4: CRD Conversion Webhook Failure During Upgrade

Impact: All Phantom CRD reads/writes fail. Operator cannot reconcile.
Response: Rollback operator deployment. If CRD schema is corrupted, manual intervention required.
Recovery time: Minutes to hours.

Troubleshooting Difficulty: HIGH

The system spans multiple domains: Kubernetes admission control, gRPC, external secret management, optional eBPF, optional CSI, optional confidential computing. Debugging a "secret not available in pod" requires tracing through:

Was the webhook called? (API server audit logs)
Was the sidecar injected? (pod spec)
Did the sidecar start before the main container? (container ordering)
Can the sidecar reach OpenBao? (network, DNS, TLS)
Did OpenBao authenticate the request? (auth method, policy)
Is the secret cached or fetched? (cache tier)
Was the secret delivered to the main container? (IPC socket)

Each step involves different logs, tools, and expertise. This is a multi-skill debugging exercise that requires Kubernetes, networking, and Vault expertise simultaneously.

OpenBao: Single Point of Failure

The Single Biggest Operational Risk

New pods created during an OpenBao outage have no cache to fall back on. The sidecar starts, tries to fetch secrets, fails, and the pod either stays unhealthy or crashes. This means: scaling events fail, node failures cascade, and deployments fail during outages. The caching strategy protects existing pods but not the cluster's ability to heal itself.

Degradation Timeline During OpenBao Outage

Time Since Outage	State	Impact
0 – 5 min	Hot cache serving	No impact. Workloads see no difference.
5 min – 1 hr	Sealed cache serving	Workloads function. New secret versions unavailable.
1 hr – 1h15m	Grace period (stale)	`PhantomSecretStale` alerts fire. Compliance posture degraded.
> 1h15m	Hard failure	Sidecar returns errors. Workloads without proper error handling will crash.
New pod starts	Immediate failure	No cache exists for new pods. Pod fails immediately. HPA, node recovery, and deployments all break.

HA Model Assessment

What Works

Raft consensus with 3 nodes tolerates 1 failure
Multi-AZ placement survives single AZ outages
Multiple endpoint failover in sidecar prevents routing to dead nodes
Health checks and connection pooling are appropriate

What's Insufficient

3 nodes is the minimum, not the recommendation. For a sole trust anchor, 5 nodes across 3 AZs is more appropriate.
No multi-region HA discussed. A single region outage takes down all secret access.
Raft storage limits. At scale, snapshot size causes election timeouts.

Recovery Procedures

Failure Mode	Recovery
Single Raft node failure	Auto-recovery via Raft. No action needed if quorum maintained.
Raft quorum loss (2/3 nodes)	Manual intervention: restore from snapshot or rebuild. Extended downtime.
Network partition (K8s ↔ OpenBao)	Fix network path. Sidecar auto-reconnects. No data loss.
OpenBao seal event	Manual unseal (or auto-unseal if configured). All nodes must be unsealed.
Storage corruption	Restore from snapshot backup. Data loss possible if backups are stale.

Blast Radius

OpenBao is a single point of failure for the entire customer base. If one OpenBao cluster serves multiple K8s clusters, a single outage affects ALL clusters simultaneously. Centralizing the trust anchor centralizes the failure domain. Each customer should have their own OpenBao cluster.

eBPF Operational Challenges

Kernel Compatibility Matrix

Provider / OS	Kernel	BTF	eBPF Capability
GKE COS	5.15+	Full	Full
EKS AL2	5.10	Partial	Degraded
EKS AL2023	6.1+	Full	Full
AKS Ubuntu 22.04	5.15+	Full	Full
AKS Mariner 2.0	5.15	Good	Full

What's Missing From This Analysis

Customer-managed node images. Enterprise customers frequently use custom AMIs/images with hardened kernels that may strip eBPF capabilities or use restrictive seccomp profiles.
Kernel upgrades happen without warning. An eBPF program that works on kernel 5.15.49 might break on 5.15.107 if a BPF verifier change rejects a previously accepted program.
seccomp and AppArmor policies. Many enterprise clusters block CAP_BPF and CAP_SYS_ADMIN. The eBPF DaemonSet needs privileged access — ironic for a security product.
CO-RE limitations. "Compile once, run everywhere" has edge cases with raw tracepoints and certain BPF map types.

Debugging eBPF in Production Is Genuinely Hard

No traditional debugger — limited to bpftool and bpf_trace_printk(). Verifier rejections are cryptic. Performance regression diagnosis requires manual correlation. Support burden requires rare and expensive eBPF expertise on the support team.

Testing Matrix Complexity

48+

Test Scenarios

16+

Test Environments

$8K–32K

Monthly Test Infra Cost

1 FTE

CI/CD Maintenance

Minimum test matrix for CI/CD:

3 cloud providers × 2+ K8s versions × 2+ node OS versions = 12+ test environments
Add Autopilot, Fargate variants = 16+ environments
Add with/without Istio, with/without existing CSI drivers = 32+ combinations
Add canary + stable sidecar versions during upgrade = 48+ test scenarios

Realistic Assessment

Maintaining this test matrix is a full-time job. Cloud provider test clusters cost $500–2,000/month per environment. At 16 environments minimum, that's $8,000–32,000/month in test infrastructure alone, plus CI/CD engineering time.

Customer Operational Burden

What the Customer Must Operate

Component	Responsibility	Expertise Required
OpenBao cluster (HA, Raft)	Customer	Vault/OpenBao administration (rare skill)
OpenBao TLS certificates	Customer	PKI management
OpenBao backup/restore	Customer	Vault ops
OpenBao auth method config	Customer	Vault + K8s auth integration
Network egress (NAT/firewall)	Customer	Cloud networking
Phantom operator upgrades	Shared	Helm, K8s operator patterns
Sidecar version rollouts	Shared	K8s namespace management
Monitoring/alerting	Customer	Prometheus, Grafana, AlertManager
Incident response	Shared	Multi-domain K8s troubleshooting
Node OS upgrades (eBPF)	Customer	eBPF understanding (if enabled)
App error handling for stale secrets	Customer's dev teams	Application-level resilience

"Senior Platform Team" Requirement

Minimum customer team: 1 person with Vault/OpenBao experience (rare), 1 Kubernetes platform engineer, 1 network engineer, plus Prometheus/Grafana access and on-call rotation. Companies with 1–2 DevOps generalists will struggle significantly.

Support Burden Estimation

Support Tier	Tickets/Month	Common Issues
L1 (basic)	5–10 per customer	"Pod won't start" (egress/OpenBao), "secrets not injecting" (labeling)
L2 (technical)	2–5 per customer	eBPF compatibility, sidecar ordering, cache behavior
L3 (engineering)	0–1 per customer	Provider-specific edge cases, CRD conversion bugs

At 50 Customers

Expect 250–750 L1 tickets/month. This requires a dedicated support team of 2–3 people minimum.

Self-Service vs Managed Service

This product has a natural gravity toward a managed service model. The operational burden of OpenBao + Phantom + monitoring + multi-provider nuances is too high for most customers to self-serve comfortably.

Model	Revenue	Support Costs	Time-to-Value	Churn Risk
Self-service (current plan)	Lower/customer	Higher	Longer	Higher
Managed OpenBao + Phantom	Higher/customer	Lower	Faster	Lower
Hybrid	Mixed	Mixed	Mixed	Mixed

Reliability & SLO Assessment

Proposed SLOs

SLO	Target	Achievable?
Secret freshness: 99.9% within 2× TTL	Stretch	Achievable under normal conditions. Breached immediately for new pods during OpenBao outages.
Webhook availability: not blocked > 5 min	Ambitious	Achievable with circuit breaker. But the escape hatch degrades security — a trade-off customers must understand.
Sidecar injection success rate: 99.9%	Reasonable	Achievable. Compatibility check logic should prevent most injection failures.

Single Points of Failure

SPOF	Impact	Mitigation
OpenBao cluster	All secret access fails for all workloads	HA (Raft 5-node), multi-region. Still a single logical dependency.
Phantom operator pod	Webhook down → circuit breaker → pods Pending	HA deployment (multiple replicas with leader election).
Conversion webhook	CRD reads/writes fail	Runs in same pods as operator. If operator is down, both fail.
Network path (K8s ↔ OpenBao)	Same as OpenBao down	Redundant network paths (VPN + internet, dual NAT). Customer responsibility.
eBPF DaemonSet	Memory monitoring stops (degraded, not hard failure)	Not a SPOF — eBPF is optional. Failure is graceful.

Key Failure Modes

Acceptable: Raft Leader Election During Load

5–30 second secret fetch failures during leader transition. Hot cache absorbs this. Sidecar retries with backoff. Non-event for most workloads.

Manageable: Webhook Pod OOMKilled

Circuit breaker opens, new pods stuck Pending. Requires proper resource limits, HPA on webhook deployment, circuit breaker with fast recovery. Needs careful capacity planning.

HIGH RISK: OpenBao TLS Certificate Expires

Certificate expiry is the #1 cause of Vault outages in production. All sidecars fail TLS handshake. This WILL happen. Recommendation: build monitoring for "OpenBao TLS cert expires in < 30 days" into the operator.

UNACCEPTABLE: AKS Removes CSI Driver During Upgrade

Documented provider behavior. Cloakfs volumes become inaccessible until CSI driver is re-installed. Consider not supporting Cloakfs on AKS, or make it a first-class operator-managed component with rapid reconciliation.

Recovery Time Objectives

Scenario	Target RTO	Realistic RTO
Webhook pod crash	< 1 min	30–60 seconds
OpenBao single node failure	0 (Raft failover)	< 5 seconds
OpenBao quorum loss	< 1 hour	1–4 hours (manual)
Network partition	Depends on cause	5 min to hours
eBPF program failure	< 1 min	Seconds (auto-detach)
Full regional outage	< 1 hour	Hours (if multi-region not configured)

Scaling Operations (100+ Clusters)

Configuration Surface Area

At 100 clusters × 10 namespaces × (values + labels + CRDs + OpenBao policies) = 4,000+ configuration points. Manual management is untenable.

Problems at Scale

OpenBao becomes the bottleneck — 100 clusters × 1,000 pods × 10 secrets = 1 million active leases. Raft performance degrades with lease volume.
Configuration drift — 100 clusters with different K8s versions, node OS versions, and provider configs. Helm values diverge.
Version skew — 5+ operator versions in production. Sidecar versions even more diverse. Need to support N-2 operator and N-3 sidecar versions.
Monitoring at scale — 100 clusters × 8+ metrics. Need multi-cluster dashboards (Thanos/Mimir) and alert deduplication.
Marketplace version management — 3 marketplaces × staggered review timelines = version chaos.

Missing: Centralized Control Plane

At scale, customers need a single pane of glass, centralized policy management, fleet-wide upgrade orchestration, and cross-cluster secret inventory. This is effectively a separate product (fleet management) requiring 3–6 months with 2–3 engineers.

Key Operational Risks (Ranked)

#	Risk	Severity	Likelihood	Mitigation
1	OpenBao outage cascades to all workloads	Critical	Medium	5-node HA, multi-region, managed OpenBao tier
2	New pods can't start during OpenBao outage	High	Medium	Bootstrap secret mode or longer-lived sealed cache
3	OpenBao TLS cert expiry	High	High	Built-in cert expiry monitoring and alerting
4	Customer can't set up OpenBao HA	High	High	Managed OpenBao offering, detailed runbooks
5	AKS removes CSI driver on upgrade	High	High	Don't rely on custom CSI on AKS
6	Multi-provider test matrix exceeds capacity	Medium	High	Focus on one provider first
7	eBPF kernel compat break after upgrade	Medium	Medium	Make eBPF optional and default-off
8	Marketplace release delays critical patches	Medium	Medium	Direct-install as primary distribution
9	Support burden exceeds capacity at scale	Medium	High	Managed service tier, better self-service
10	CRD conversion webhook failure	Medium	Low	HA operator, rollback runbooks

Recommendations for Simplification

1. Offer Managed OpenBao as a Core Product Tier

This single change eliminates the top 4 operational risks for customers and accelerates time-to-value from weeks to hours. Pricing: Starter (shared, included) → Professional (dedicated 3-node, +€500/mo) → Enterprise (5-node multi-region, +€2,000/mo).

2. Single Provider First (GKE Standard)

"Works perfectly on one provider" beats "works partially on three." Reduces test matrix from 48+ to 6 scenarios. CI/CD cost drops from $8K–32K/month to $1K–3K/month. Expand to EKS at month 7–9, AKS at month 10–12.

3. Make eBPF Explicitly Optional and Advanced

Don't make it a default feature. The operational complexity (kernel compatibility, debugging difficulty, performance impact) is disproportionate. Most customers care about secrets not being in etcd — they don't need kernel-level memory monitoring.

4. Defer Cloakfs

CSI driver landscape is too fragmented across providers. AKS is unreliable, Autopilot blocks it, Fargate can't run it. Ship Phantom (secrets) first, prove the market, then tackle at-rest encryption.

5. Direct Helm Install Before Marketplace

Marketplace listings add 2–3 months of packaging/certification work per provider. Ship directly first, add marketplace presence as a growth channel, not a launch requirement.

6. Build an Installation Wizard (`phantom-cli`)

A CLI tool that walks customers through: OpenBao setup → network connectivity → Phantom install → validation. Replaces 10 pages of docs with a guided experience. Reduces L1 support tickets by 50%.

7. Add "Bootstrap Secret" Mode

Allow critical secrets to be pre-provisioned so new pods can start during an OpenBao outage, then refresh when available. Addresses the "single biggest operational risk" identified in this assessment.

8. Monitor OpenBao TLS from Inside the Operator

Don't rely on the customer to monitor certificate expiry. Have the sidecar or operator check the certificate chain and alert when expiry is < 30 days.

Improvement Roadmap: 5/10 → 7.5/10

Targeted improvements that are high-impact but moderate-effort — moving the DevOps score from 5/10 to 7.5/10.

Improvement	Score Impact	Effort	Risks Addressed
A1. Managed OpenBao-as-a-Service	+1.5	6–8 weeks	#1, #3, #4
A2. `phantom-cli` bootstrap tool	+0.5	2–3 weeks	Time-to-value, L1 tickets
A3. Bootstrap secret mode	+0.75	2–3 weeks	#2 (biggest operational risk)
A4. Single-provider launch (GKE)	+0.5	1 week (planning)	#6
A5. Operational runbook templates	+0.5	1 week	Troubleshooting, MTTR
A6. SaaS observability dashboard	+0.5	4–6 weeks	Monitoring burden
A7. Auto-upgrade operator	+0.25	3–4 weeks	Day-2 complexity, version skew
Total	+4.5 (capped at 7.5/10)	~16–24 weeks	8 of top 10 risks

Priority Order

A1 + A3 (Managed OpenBao + Bootstrap Secrets) — address the two highest-severity risks
A2 + A5 (CLI + Runbooks) — fastest to build, immediate support burden reduction
A4 (GKE-first strategy) — a planning decision, not engineering work
A6 + A7 (SaaS dashboard + Auto-upgrade) — polish for Day-2 experience

Realistic vs Aspirational

What's Realistic (5–7 person team)

Phantom core (webhook + sidecar + secrets) on a single provider
Pre-flight connectivity checks
Circuit breaker pattern
Canary injection
3-tier secret caching (gap: new pods during outages)

What's Aspirational

Full feature parity across 3 providers + Autopilot + Fargate
eBPF monitoring as a standard (not optional) feature
Cloakfs CSI across all providers
Simultaneous marketplace listings on all three
Customer self-service at scale without managed OpenBao

Operational Feasibility