Operations Score
5 / 10
Improvable to 7.5/10
Installation Difficulty
7 / 10
Troubleshooting
HIGH
Day-2 Complexity
MOD-HIGH

Deployment Complexity Assessment

Installation Difficulty: HIGH (7/10)

Phantom is not a "helm install and go" product. It's a multi-component system with an external dependency (OpenBao) that must be operational before anything else works.

Minimum Installation Steps

  1. Provision and configure OpenBao cluster in the EU — 3+ node Raft cluster, TLS certificates, auth methods, policies. This alone is a 2–5 day project for an experienced Vault/OpenBao operator. For someone unfamiliar, 1–2 weeks.
  2. Ensure egress connectivity from the target K8s cluster to OpenBao — Cloud NAT (GKE), NAT Gateway (EKS), outbound rules (AKS). On private clusters this is a networking project involving infrastructure teams, firewall rules, and potentially VPN/peering.
  3. Install the Phantom operator via Helm — the easiest part, but still requires configuring OpenBao endpoints, auth credentials, TLS trust, namespace policies, resource limits.
  4. Label namespaces and workloads — decide protection scope, configure canary vs stable channels.
  5. Validate — run pre-flight checks, verify sidecar injection, confirm secret delivery end-to-end.

Per-Provider Deployment Differences

StepGKE StandardGKE AutopilotEKSEKS FargateAKS
Egress setupCloud NATCloud NATNAT GatewayNAT GatewayNAT GW / LB rules (mandatory after Mar 2026)
Webhook portMust be 443 (private clusters)Must be 443AnyAnyAny
Sidecar QoSBest practice: GuaranteedRequired: GuaranteedFlexibleFlexible (no privileged)Flexible
CSI (Cloakfs)Full supportBlocked (partner list)Full supportNo DaemonSetsUnstable (removed on upgrade)
eBPF DaemonSetSupportedRestrictedSupportedNo DaemonSetsSupported
Confidential computingFull SEV-SNP/TDXLimitedNitro Enclaves onlyNot availableCVM node pools (Kata CC sunsetting)

No Single Deployment Path

Each provider needs its own runbook, and GKE Autopilot and EKS Fargate are significantly constrained environments where multiple features simply don't work.

Prerequisites & Dependencies

PrerequisiteComplexityWho Owns It
OpenBao cluster (EU, HA)HIGH — requires Vault/OpenBao expertiseCustomer or managed service
Network egress to OpenBaoMEDIUM — involves infra/networking teamCustomer
TLS certificates for OpenBaoMEDIUM — PKI managementCustomer
Kernel version >= 5.10 (eBPF)LOW — constrains node OS choicesCustomer
Namespace labeling strategyLOW — requires planningCustomer
Prometheus/Grafana stackLOW — likely already existsCustomer

Time-to-Value

Customer ProfileTime to First Protected Workload
Platform team with Vault experience, existing monitoring3–5 days
Platform team without Vault experience1–3 weeks
Small team, first K8s operator experience3–6 weeks
Enterprise with change management processes6–12 weeks (including approvals)

Deployment Bottleneck

Time-to-value is dominated by OpenBao setup and network configuration, not by Phantom itself. The external trust anchor is both the product's strength and its deployment bottleneck.

Multi-Provider Feature Matrix

Feature Availability

FeatureGKE StandardGKE AutopilotEKS (EC2)EKS FargateAKS
Webhook injectionYesYes (with exclusions)YesYesYes (with exclusions)
Sidecar injectionYesYes (Guaranteed QoS)YesYes (unprivileged)Yes
eBPF monitoringYesRestrictedYes (AL2023) / Degraded (AL2)NoYes
Cloakfs CSIYesNoYesNoUnreliable
Confidential computingFull (SEV-SNP + TDX)LimitedNitro onlyNoCVM pools
Pre-flight checkYesYesYesYesYes
External OpenBaoRequires Cloud NATRequires Cloud NATRequires NAT GWRequires NAT GWRequires NAT GW/LB
Canary injectionYesYesYesYesYes

Feature Degradation Summary

Provider + ModePhantom CoreeBPFCloakfsSpecterNet Score
GKE StandardFullFullFullFull100%
GKE AutopilotFull (constrained)DegradedNoneLimited55%
EKS (EC2, AL2023)FullFullFullDifferent arch75%
EKS (EC2, AL2)FullDegradedFullDifferent arch65%
EKS FargateFull (unprivileged)NoneNoneNone35%
AKSFullFullUnreliableDegraded (Kata sunsetting)65%

Only GKE Standard provides the full feature set

Every other provider/mode requires feature flags, degraded capabilities, or entirely different architectures. This means documentation must be provider-specific, testing must cover every combination, and marketing must be careful not to promise capabilities that only work on one provider.

Operational Differences Per Provider

Operational TaskGKEEKSAKS
Install webhookWatch port 443, Cloud NATStraightforwardWatch Admissions Enforcer
Upgrade operatorStandard rolling updateStandard rolling updateCSI driver may be removed
eBPF troubleshootingFull tooling (COS has bpftool)AL2: limited BTF. AL2023: fullFull tooling
Node OS upgradeAuto-upgrade (may break eBPF)Managed node group updateNode image + K8s version are separate
Marketplace updatePush to Artifact RegistryNo hooks, no lookup (EKS add-on)Bundle all images (CNAB)

Day-2 Operations

Upgrade/Rollback Complexity: MODERATE-HIGH

What's Good

  • Canary injection via namespace labels is a sound pattern
  • Operator's RollingUpdate with maxUnavailable: 0 and leader election is standard practice
  • N-1 backward compatibility between operator and sidecars is the right approach

What's Concerning

  • Mixed sidecar versions unavoidable — long-running pods keep injected sidecars until restarted. N-1 becomes N-3 in practice.
  • CRD conversion webhooks add another failure mode in the same failure domain as admission webhook.
  • Rollback isn't truly instant — changing a label only affects newly created pods.
  • OpenBao upgrades are out of band — no coordination with Phantom upgrades.

Monitoring & Alerting Requirements

Metric / AlertPurposeCriticality
phantom_secret_age_secondsSecret freshness SLIP1 — core SLO metric
phantom_secret_fetch_errors_totalOpenBao connectivity healthP1 — early warning
phantom_sidecar_versionVersion distribution during upgradesP2 — operational awareness
PhantomWebhookDegradedCircuit breaker openedP1 — pods stuck Pending
PhantomSecretStaleServing stale secrets in degraded modeP1 — security posture degraded
OpenBao health (/sys/health)External dependency healthP1 — upstream dependency
Webhook latency (p99)Performance regressionP2 — user experience
eBPF program CPU overheadKernel-level performance impactP2 — node health

Significant Monitoring Surface

Customers need a working Prometheus + AlertManager stack, Grafana dashboards, and on-call processes for at least 5–6 P1 alerts. This isn't unusual for a security-critical component, but it raises the operational bar.

Incident Response Scenarios

Scenario 1: Webhook Down (Circuit Breaker Open)

Impact: All new pods in protected namespaces stuck Pending.
Detection: PhantomWebhookDegraded alert.
Response: Check operator pod health, restart if needed. If prolonged, use phantom.io/circuit-breaker: "bypass" on critical namespaces (security trade-off — documented and auditable).
Recovery time: Minutes (if operator pod restarts) to hours (if underlying issue is deeper).

Scenario 2: OpenBao Unreachable

Impact: New pods can't fetch secrets. Existing pods serve from cache (5 min hot, 1 hr sealed, 15 min grace).
Critical window: 1 hour 20 minutes (1hr sealed cache + 15min grace). After this, workloads start failing. Workloads that restart during outage have only the 15min grace period.
Recovery time: Depends entirely on OpenBao recovery — could be minutes or hours.

Scenario 3: eBPF Program Causing Node Instability

Impact: Latency spikes, CPU overhead on affected nodes.
Response: Detach eBPF programs (auto-detach threshold should handle this). If not, drain the node and disable eBPF feature flag.
Recovery time: Minutes if auto-detach works. Hours if manual intervention needed.

Scenario 4: CRD Conversion Webhook Failure During Upgrade

Impact: All Phantom CRD reads/writes fail. Operator cannot reconcile.
Response: Rollback operator deployment. If CRD schema is corrupted, manual intervention required.
Recovery time: Minutes to hours.

Troubleshooting Difficulty: HIGH

The system spans multiple domains: Kubernetes admission control, gRPC, external secret management, optional eBPF, optional CSI, optional confidential computing. Debugging a "secret not available in pod" requires tracing through:

  1. Was the webhook called? (API server audit logs)
  2. Was the sidecar injected? (pod spec)
  3. Did the sidecar start before the main container? (container ordering)
  4. Can the sidecar reach OpenBao? (network, DNS, TLS)
  5. Did OpenBao authenticate the request? (auth method, policy)
  6. Is the secret cached or fetched? (cache tier)
  7. Was the secret delivered to the main container? (IPC socket)

Each step involves different logs, tools, and expertise. This is a multi-skill debugging exercise that requires Kubernetes, networking, and Vault expertise simultaneously.

OpenBao: Single Point of Failure

The Single Biggest Operational Risk

New pods created during an OpenBao outage have no cache to fall back on. The sidecar starts, tries to fetch secrets, fails, and the pod either stays unhealthy or crashes. This means: scaling events fail, node failures cascade, and deployments fail during outages. The caching strategy protects existing pods but not the cluster's ability to heal itself.

Degradation Timeline During OpenBao Outage

Time Since OutageStateImpact
0 – 5 minHot cache servingNo impact. Workloads see no difference.
5 min – 1 hrSealed cache servingWorkloads function. New secret versions unavailable.
1 hr – 1h15mGrace period (stale)PhantomSecretStale alerts fire. Compliance posture degraded.
> 1h15mHard failureSidecar returns errors. Workloads without proper error handling will crash.
New pod startsImmediate failureNo cache exists for new pods. Pod fails immediately. HPA, node recovery, and deployments all break.

HA Model Assessment

What Works

  • Raft consensus with 3 nodes tolerates 1 failure
  • Multi-AZ placement survives single AZ outages
  • Multiple endpoint failover in sidecar prevents routing to dead nodes
  • Health checks and connection pooling are appropriate

What's Insufficient

  • 3 nodes is the minimum, not the recommendation. For a sole trust anchor, 5 nodes across 3 AZs is more appropriate.
  • No multi-region HA discussed. A single region outage takes down all secret access.
  • Raft storage limits. At scale, snapshot size causes election timeouts.

Recovery Procedures

Failure ModeRecovery
Single Raft node failureAuto-recovery via Raft. No action needed if quorum maintained.
Raft quorum loss (2/3 nodes)Manual intervention: restore from snapshot or rebuild. Extended downtime.
Network partition (K8s ↔ OpenBao)Fix network path. Sidecar auto-reconnects. No data loss.
OpenBao seal eventManual unseal (or auto-unseal if configured). All nodes must be unsealed.
Storage corruptionRestore from snapshot backup. Data loss possible if backups are stale.

Blast Radius

OpenBao is a single point of failure for the entire customer base. If one OpenBao cluster serves multiple K8s clusters, a single outage affects ALL clusters simultaneously. Centralizing the trust anchor centralizes the failure domain. Each customer should have their own OpenBao cluster.

eBPF Operational Challenges

Kernel Compatibility Matrix

Provider / OSKernelBTFeBPF Capability
GKE COS5.15+FullFull
EKS AL25.10PartialDegraded
EKS AL20236.1+FullFull
AKS Ubuntu 22.045.15+FullFull
AKS Mariner 2.05.15GoodFull

What's Missing From This Analysis

  1. Customer-managed node images. Enterprise customers frequently use custom AMIs/images with hardened kernels that may strip eBPF capabilities or use restrictive seccomp profiles.
  2. Kernel upgrades happen without warning. An eBPF program that works on kernel 5.15.49 might break on 5.15.107 if a BPF verifier change rejects a previously accepted program.
  3. seccomp and AppArmor policies. Many enterprise clusters block CAP_BPF and CAP_SYS_ADMIN. The eBPF DaemonSet needs privileged access — ironic for a security product.
  4. CO-RE limitations. "Compile once, run everywhere" has edge cases with raw tracepoints and certain BPF map types.

Debugging eBPF in Production Is Genuinely Hard

No traditional debugger — limited to bpftool and bpf_trace_printk(). Verifier rejections are cryptic. Performance regression diagnosis requires manual correlation. Support burden requires rare and expensive eBPF expertise on the support team.

Testing Matrix Complexity

48+
Test Scenarios
16+
Test Environments
$8K–32K
Monthly Test Infra Cost
1 FTE
CI/CD Maintenance

Minimum test matrix for CI/CD:

  • 3 cloud providers × 2+ K8s versions × 2+ node OS versions = 12+ test environments
  • Add Autopilot, Fargate variants = 16+ environments
  • Add with/without Istio, with/without existing CSI drivers = 32+ combinations
  • Add canary + stable sidecar versions during upgrade = 48+ test scenarios

Realistic Assessment

Maintaining this test matrix is a full-time job. Cloud provider test clusters cost $500–2,000/month per environment. At 16 environments minimum, that's $8,000–32,000/month in test infrastructure alone, plus CI/CD engineering time.

Customer Operational Burden

What the Customer Must Operate

ComponentResponsibilityExpertise Required
OpenBao cluster (HA, Raft)CustomerVault/OpenBao administration (rare skill)
OpenBao TLS certificatesCustomerPKI management
OpenBao backup/restoreCustomerVault ops
OpenBao auth method configCustomerVault + K8s auth integration
Network egress (NAT/firewall)CustomerCloud networking
Phantom operator upgradesSharedHelm, K8s operator patterns
Sidecar version rolloutsSharedK8s namespace management
Monitoring/alertingCustomerPrometheus, Grafana, AlertManager
Incident responseSharedMulti-domain K8s troubleshooting
Node OS upgrades (eBPF)CustomereBPF understanding (if enabled)
App error handling for stale secretsCustomer's dev teamsApplication-level resilience

"Senior Platform Team" Requirement

Minimum customer team: 1 person with Vault/OpenBao experience (rare), 1 Kubernetes platform engineer, 1 network engineer, plus Prometheus/Grafana access and on-call rotation. Companies with 1–2 DevOps generalists will struggle significantly.

Support Burden Estimation

Support TierTickets/MonthCommon Issues
L1 (basic)5–10 per customer"Pod won't start" (egress/OpenBao), "secrets not injecting" (labeling)
L2 (technical)2–5 per customereBPF compatibility, sidecar ordering, cache behavior
L3 (engineering)0–1 per customerProvider-specific edge cases, CRD conversion bugs

At 50 Customers

Expect 250–750 L1 tickets/month. This requires a dedicated support team of 2–3 people minimum.

Self-Service vs Managed Service

This product has a natural gravity toward a managed service model. The operational burden of OpenBao + Phantom + monitoring + multi-provider nuances is too high for most customers to self-serve comfortably.

ModelRevenueSupport CostsTime-to-ValueChurn Risk
Self-service (current plan)Lower/customerHigherLongerHigher
Managed OpenBao + PhantomHigher/customerLowerFasterLower
HybridMixedMixedMixedMixed

Reliability & SLO Assessment

Proposed SLOs

SLOTargetAchievable?
Secret freshness: 99.9% within 2× TTLStretchAchievable under normal conditions. Breached immediately for new pods during OpenBao outages.
Webhook availability: not blocked > 5 minAmbitiousAchievable with circuit breaker. But the escape hatch degrades security — a trade-off customers must understand.
Sidecar injection success rate: 99.9%ReasonableAchievable. Compatibility check logic should prevent most injection failures.

Single Points of Failure

SPOFImpactMitigation
OpenBao clusterAll secret access fails for all workloadsHA (Raft 5-node), multi-region. Still a single logical dependency.
Phantom operator podWebhook down → circuit breaker → pods PendingHA deployment (multiple replicas with leader election).
Conversion webhookCRD reads/writes failRuns in same pods as operator. If operator is down, both fail.
Network path (K8s ↔ OpenBao)Same as OpenBao downRedundant network paths (VPN + internet, dual NAT). Customer responsibility.
eBPF DaemonSetMemory monitoring stops (degraded, not hard failure)Not a SPOF — eBPF is optional. Failure is graceful.

Key Failure Modes

Acceptable: Raft Leader Election During Load

5–30 second secret fetch failures during leader transition. Hot cache absorbs this. Sidecar retries with backoff. Non-event for most workloads.

Manageable: Webhook Pod OOMKilled

Circuit breaker opens, new pods stuck Pending. Requires proper resource limits, HPA on webhook deployment, circuit breaker with fast recovery. Needs careful capacity planning.

HIGH RISK: OpenBao TLS Certificate Expires

Certificate expiry is the #1 cause of Vault outages in production. All sidecars fail TLS handshake. This WILL happen. Recommendation: build monitoring for "OpenBao TLS cert expires in < 30 days" into the operator.

UNACCEPTABLE: AKS Removes CSI Driver During Upgrade

Documented provider behavior. Cloakfs volumes become inaccessible until CSI driver is re-installed. Consider not supporting Cloakfs on AKS, or make it a first-class operator-managed component with rapid reconciliation.

Recovery Time Objectives

ScenarioTarget RTORealistic RTO
Webhook pod crash< 1 min30–60 seconds
OpenBao single node failure0 (Raft failover)< 5 seconds
OpenBao quorum loss< 1 hour1–4 hours (manual)
Network partitionDepends on cause5 min to hours
eBPF program failure< 1 minSeconds (auto-detach)
Full regional outage< 1 hourHours (if multi-region not configured)

Scaling Operations (100+ Clusters)

Configuration Surface Area

At 100 clusters × 10 namespaces × (values + labels + CRDs + OpenBao policies) = 4,000+ configuration points. Manual management is untenable.

Problems at Scale

  1. OpenBao becomes the bottleneck — 100 clusters × 1,000 pods × 10 secrets = 1 million active leases. Raft performance degrades with lease volume.
  2. Configuration drift — 100 clusters with different K8s versions, node OS versions, and provider configs. Helm values diverge.
  3. Version skew — 5+ operator versions in production. Sidecar versions even more diverse. Need to support N-2 operator and N-3 sidecar versions.
  4. Monitoring at scale — 100 clusters × 8+ metrics. Need multi-cluster dashboards (Thanos/Mimir) and alert deduplication.
  5. Marketplace version management — 3 marketplaces × staggered review timelines = version chaos.

Missing: Centralized Control Plane

At scale, customers need a single pane of glass, centralized policy management, fleet-wide upgrade orchestration, and cross-cluster secret inventory. This is effectively a separate product (fleet management) requiring 3–6 months with 2–3 engineers.

Key Operational Risks (Ranked)

#RiskSeverityLikelihoodMitigation
1OpenBao outage cascades to all workloadsCriticalMedium5-node HA, multi-region, managed OpenBao tier
2New pods can't start during OpenBao outageHighMediumBootstrap secret mode or longer-lived sealed cache
3OpenBao TLS cert expiryHighHighBuilt-in cert expiry monitoring and alerting
4Customer can't set up OpenBao HAHighHighManaged OpenBao offering, detailed runbooks
5AKS removes CSI driver on upgradeHighHighDon't rely on custom CSI on AKS
6Multi-provider test matrix exceeds capacityMediumHighFocus on one provider first
7eBPF kernel compat break after upgradeMediumMediumMake eBPF optional and default-off
8Marketplace release delays critical patchesMediumMediumDirect-install as primary distribution
9Support burden exceeds capacity at scaleMediumHighManaged service tier, better self-service
10CRD conversion webhook failureMediumLowHA operator, rollback runbooks

Recommendations for Simplification

1. Offer Managed OpenBao as a Core Product Tier

This single change eliminates the top 4 operational risks for customers and accelerates time-to-value from weeks to hours. Pricing: Starter (shared, included) → Professional (dedicated 3-node, +€500/mo) → Enterprise (5-node multi-region, +€2,000/mo).

2. Single Provider First (GKE Standard)

"Works perfectly on one provider" beats "works partially on three." Reduces test matrix from 48+ to 6 scenarios. CI/CD cost drops from $8K–32K/month to $1K–3K/month. Expand to EKS at month 7–9, AKS at month 10–12.

3. Make eBPF Explicitly Optional and Advanced

Don't make it a default feature. The operational complexity (kernel compatibility, debugging difficulty, performance impact) is disproportionate. Most customers care about secrets not being in etcd — they don't need kernel-level memory monitoring.

4. Defer Cloakfs

CSI driver landscape is too fragmented across providers. AKS is unreliable, Autopilot blocks it, Fargate can't run it. Ship Phantom (secrets) first, prove the market, then tackle at-rest encryption.

5. Direct Helm Install Before Marketplace

Marketplace listings add 2–3 months of packaging/certification work per provider. Ship directly first, add marketplace presence as a growth channel, not a launch requirement.

6. Build an Installation Wizard (phantom-cli)

A CLI tool that walks customers through: OpenBao setup → network connectivity → Phantom install → validation. Replaces 10 pages of docs with a guided experience. Reduces L1 support tickets by 50%.

7. Add "Bootstrap Secret" Mode

Allow critical secrets to be pre-provisioned so new pods can start during an OpenBao outage, then refresh when available. Addresses the "single biggest operational risk" identified in this assessment.

8. Monitor OpenBao TLS from Inside the Operator

Don't rely on the customer to monitor certificate expiry. Have the sidecar or operator check the certificate chain and alert when expiry is < 30 days.

Improvement Roadmap: 5/10 → 7.5/10

Targeted improvements that are high-impact but moderate-effort — moving the DevOps score from 5/10 to 7.5/10.

ImprovementScore ImpactEffortRisks Addressed
A1. Managed OpenBao-as-a-Service+1.56–8 weeks#1, #3, #4
A2. phantom-cli bootstrap tool+0.52–3 weeksTime-to-value, L1 tickets
A3. Bootstrap secret mode+0.752–3 weeks#2 (biggest operational risk)
A4. Single-provider launch (GKE)+0.51 week (planning)#6
A5. Operational runbook templates+0.51 weekTroubleshooting, MTTR
A6. SaaS observability dashboard+0.54–6 weeksMonitoring burden
A7. Auto-upgrade operator+0.253–4 weeksDay-2 complexity, version skew
Total+4.5 (capped at 7.5/10)~16–24 weeks8 of top 10 risks

Priority Order

  1. A1 + A3 (Managed OpenBao + Bootstrap Secrets) — address the two highest-severity risks
  2. A2 + A5 (CLI + Runbooks) — fastest to build, immediate support burden reduction
  3. A4 (GKE-first strategy) — a planning decision, not engineering work
  4. A6 + A7 (SaaS dashboard + Auto-upgrade) — polish for Day-2 experience

Operational Feasibility: 5/10 → 7.5/10 (with improvements)

The architecture is sound but operationally demanding. Designed by people who understand Kubernetes security deeply, but the operational burden may exceed what most customers can handle. The multi-provider ambition multiplies every challenge by three. With the recommended improvements — particularly managed OpenBao and a GKE-first strategy — this becomes operationally manageable for teams with standard Kubernetes skills.

Realistic vs Aspirational

What's Realistic (5–7 person team)

  • Phantom core (webhook + sidecar + secrets) on a single provider
  • Pre-flight connectivity checks
  • Circuit breaker pattern
  • Canary injection
  • 3-tier secret caching (gap: new pods during outages)

What's Aspirational

  • Full feature parity across 3 providers + Autopilot + Fargate
  • eBPF monitoring as a standard (not optional) feature
  • Cloakfs CSI across all providers
  • Simultaneous marketplace listings on all three
  • Customer self-service at scale without managed OpenBao