Operational Feasibility
DevOps assessment from a Senior Platform Engineer with production experience across GKE, EKS, AKS, and multi-cluster Kubernetes environments.
Deployment Complexity Assessment
Installation Difficulty: HIGH (7/10)
Phantom is not a "helm install and go" product. It's a multi-component system with an external dependency (OpenBao) that must be operational before anything else works.
Minimum Installation Steps
- Provision and configure OpenBao cluster in the EU — 3+ node Raft cluster, TLS certificates, auth methods, policies. This alone is a 2–5 day project for an experienced Vault/OpenBao operator. For someone unfamiliar, 1–2 weeks.
- Ensure egress connectivity from the target K8s cluster to OpenBao — Cloud NAT (GKE), NAT Gateway (EKS), outbound rules (AKS). On private clusters this is a networking project involving infrastructure teams, firewall rules, and potentially VPN/peering.
- Install the Phantom operator via Helm — the easiest part, but still requires configuring OpenBao endpoints, auth credentials, TLS trust, namespace policies, resource limits.
- Label namespaces and workloads — decide protection scope, configure canary vs stable channels.
- Validate — run pre-flight checks, verify sidecar injection, confirm secret delivery end-to-end.
Per-Provider Deployment Differences
| Step | GKE Standard | GKE Autopilot | EKS | EKS Fargate | AKS |
|---|---|---|---|---|---|
| Egress setup | Cloud NAT | Cloud NAT | NAT Gateway | NAT Gateway | NAT GW / LB rules (mandatory after Mar 2026) |
| Webhook port | Must be 443 (private clusters) | Must be 443 | Any | Any | Any |
| Sidecar QoS | Best practice: Guaranteed | Required: Guaranteed | Flexible | Flexible (no privileged) | Flexible |
| CSI (Cloakfs) | Full support | Blocked (partner list) | Full support | No DaemonSets | Unstable (removed on upgrade) |
| eBPF DaemonSet | Supported | Restricted | Supported | No DaemonSets | Supported |
| Confidential computing | Full SEV-SNP/TDX | Limited | Nitro Enclaves only | Not available | CVM node pools (Kata CC sunsetting) |
No Single Deployment Path
Each provider needs its own runbook, and GKE Autopilot and EKS Fargate are significantly constrained environments where multiple features simply don't work.
Prerequisites & Dependencies
| Prerequisite | Complexity | Who Owns It |
|---|---|---|
| OpenBao cluster (EU, HA) | HIGH — requires Vault/OpenBao expertise | Customer or managed service |
| Network egress to OpenBao | MEDIUM — involves infra/networking team | Customer |
| TLS certificates for OpenBao | MEDIUM — PKI management | Customer |
| Kernel version >= 5.10 (eBPF) | LOW — constrains node OS choices | Customer |
| Namespace labeling strategy | LOW — requires planning | Customer |
| Prometheus/Grafana stack | LOW — likely already exists | Customer |
Time-to-Value
| Customer Profile | Time to First Protected Workload |
|---|---|
| Platform team with Vault experience, existing monitoring | 3–5 days |
| Platform team without Vault experience | 1–3 weeks |
| Small team, first K8s operator experience | 3–6 weeks |
| Enterprise with change management processes | 6–12 weeks (including approvals) |
Deployment Bottleneck
Time-to-value is dominated by OpenBao setup and network configuration, not by Phantom itself. The external trust anchor is both the product's strength and its deployment bottleneck.
Multi-Provider Feature Matrix
Feature Availability
| Feature | GKE Standard | GKE Autopilot | EKS (EC2) | EKS Fargate | AKS |
|---|---|---|---|---|---|
| Webhook injection | Yes | Yes (with exclusions) | Yes | Yes | Yes (with exclusions) |
| Sidecar injection | Yes | Yes (Guaranteed QoS) | Yes | Yes (unprivileged) | Yes |
| eBPF monitoring | Yes | Restricted | Yes (AL2023) / Degraded (AL2) | No | Yes |
| Cloakfs CSI | Yes | No | Yes | No | Unreliable |
| Confidential computing | Full (SEV-SNP + TDX) | Limited | Nitro only | No | CVM pools |
| Pre-flight check | Yes | Yes | Yes | Yes | Yes |
| External OpenBao | Requires Cloud NAT | Requires Cloud NAT | Requires NAT GW | Requires NAT GW | Requires NAT GW/LB |
| Canary injection | Yes | Yes | Yes | Yes | Yes |
Feature Degradation Summary
| Provider + Mode | Phantom Core | eBPF | Cloakfs | Specter | Net Score |
|---|---|---|---|---|---|
| GKE Standard | Full | Full | Full | Full | 100% |
| GKE Autopilot | Full (constrained) | Degraded | None | Limited | 55% |
| EKS (EC2, AL2023) | Full | Full | Full | Different arch | 75% |
| EKS (EC2, AL2) | Full | Degraded | Full | Different arch | 65% |
| EKS Fargate | Full (unprivileged) | None | None | None | 35% |
| AKS | Full | Full | Unreliable | Degraded (Kata sunsetting) | 65% |
Only GKE Standard provides the full feature set
Every other provider/mode requires feature flags, degraded capabilities, or entirely different architectures. This means documentation must be provider-specific, testing must cover every combination, and marketing must be careful not to promise capabilities that only work on one provider.
Operational Differences Per Provider
| Operational Task | GKE | EKS | AKS |
|---|---|---|---|
| Install webhook | Watch port 443, Cloud NAT | Straightforward | Watch Admissions Enforcer |
| Upgrade operator | Standard rolling update | Standard rolling update | CSI driver may be removed |
| eBPF troubleshooting | Full tooling (COS has bpftool) | AL2: limited BTF. AL2023: full | Full tooling |
| Node OS upgrade | Auto-upgrade (may break eBPF) | Managed node group update | Node image + K8s version are separate |
| Marketplace update | Push to Artifact Registry | No hooks, no lookup (EKS add-on) | Bundle all images (CNAB) |
Day-2 Operations
Upgrade/Rollback Complexity: MODERATE-HIGH
What's Good
- Canary injection via namespace labels is a sound pattern
- Operator's
RollingUpdatewithmaxUnavailable: 0and leader election is standard practice - N-1 backward compatibility between operator and sidecars is the right approach
What's Concerning
- Mixed sidecar versions unavoidable — long-running pods keep injected sidecars until restarted. N-1 becomes N-3 in practice.
- CRD conversion webhooks add another failure mode in the same failure domain as admission webhook.
- Rollback isn't truly instant — changing a label only affects newly created pods.
- OpenBao upgrades are out of band — no coordination with Phantom upgrades.
Monitoring & Alerting Requirements
| Metric / Alert | Purpose | Criticality |
|---|---|---|
phantom_secret_age_seconds | Secret freshness SLI | P1 — core SLO metric |
phantom_secret_fetch_errors_total | OpenBao connectivity health | P1 — early warning |
phantom_sidecar_version | Version distribution during upgrades | P2 — operational awareness |
PhantomWebhookDegraded | Circuit breaker opened | P1 — pods stuck Pending |
PhantomSecretStale | Serving stale secrets in degraded mode | P1 — security posture degraded |
OpenBao health (/sys/health) | External dependency health | P1 — upstream dependency |
| Webhook latency (p99) | Performance regression | P2 — user experience |
| eBPF program CPU overhead | Kernel-level performance impact | P2 — node health |
Significant Monitoring Surface
Customers need a working Prometheus + AlertManager stack, Grafana dashboards, and on-call processes for at least 5–6 P1 alerts. This isn't unusual for a security-critical component, but it raises the operational bar.
Incident Response Scenarios
Scenario 1: Webhook Down (Circuit Breaker Open)
Impact: All new pods in protected namespaces stuck Pending.
Detection: PhantomWebhookDegraded alert.
Response: Check operator pod health, restart if needed. If prolonged, use phantom.io/circuit-breaker: "bypass" on critical namespaces (security trade-off — documented and auditable).
Recovery time: Minutes (if operator pod restarts) to hours (if underlying issue is deeper).
Scenario 2: OpenBao Unreachable
Impact: New pods can't fetch secrets. Existing pods serve from cache (5 min hot, 1 hr sealed, 15 min grace).
Critical window: 1 hour 20 minutes (1hr sealed cache + 15min grace). After this, workloads start failing. Workloads that restart during outage have only the 15min grace period.
Recovery time: Depends entirely on OpenBao recovery — could be minutes or hours.
Scenario 3: eBPF Program Causing Node Instability
Impact: Latency spikes, CPU overhead on affected nodes.
Response: Detach eBPF programs (auto-detach threshold should handle this). If not, drain the node and disable eBPF feature flag.
Recovery time: Minutes if auto-detach works. Hours if manual intervention needed.
Scenario 4: CRD Conversion Webhook Failure During Upgrade
Impact: All Phantom CRD reads/writes fail. Operator cannot reconcile.
Response: Rollback operator deployment. If CRD schema is corrupted, manual intervention required.
Recovery time: Minutes to hours.
Troubleshooting Difficulty: HIGH
The system spans multiple domains: Kubernetes admission control, gRPC, external secret management, optional eBPF, optional CSI, optional confidential computing. Debugging a "secret not available in pod" requires tracing through:
- Was the webhook called? (API server audit logs)
- Was the sidecar injected? (pod spec)
- Did the sidecar start before the main container? (container ordering)
- Can the sidecar reach OpenBao? (network, DNS, TLS)
- Did OpenBao authenticate the request? (auth method, policy)
- Is the secret cached or fetched? (cache tier)
- Was the secret delivered to the main container? (IPC socket)
Each step involves different logs, tools, and expertise. This is a multi-skill debugging exercise that requires Kubernetes, networking, and Vault expertise simultaneously.
OpenBao: Single Point of Failure
The Single Biggest Operational Risk
New pods created during an OpenBao outage have no cache to fall back on. The sidecar starts, tries to fetch secrets, fails, and the pod either stays unhealthy or crashes. This means: scaling events fail, node failures cascade, and deployments fail during outages. The caching strategy protects existing pods but not the cluster's ability to heal itself.
Degradation Timeline During OpenBao Outage
| Time Since Outage | State | Impact |
|---|---|---|
| 0 – 5 min | Hot cache serving | No impact. Workloads see no difference. |
| 5 min – 1 hr | Sealed cache serving | Workloads function. New secret versions unavailable. |
| 1 hr – 1h15m | Grace period (stale) | PhantomSecretStale alerts fire. Compliance posture degraded. |
| > 1h15m | Hard failure | Sidecar returns errors. Workloads without proper error handling will crash. |
| New pod starts | Immediate failure | No cache exists for new pods. Pod fails immediately. HPA, node recovery, and deployments all break. |
HA Model Assessment
What Works
- Raft consensus with 3 nodes tolerates 1 failure
- Multi-AZ placement survives single AZ outages
- Multiple endpoint failover in sidecar prevents routing to dead nodes
- Health checks and connection pooling are appropriate
What's Insufficient
- 3 nodes is the minimum, not the recommendation. For a sole trust anchor, 5 nodes across 3 AZs is more appropriate.
- No multi-region HA discussed. A single region outage takes down all secret access.
- Raft storage limits. At scale, snapshot size causes election timeouts.
Recovery Procedures
| Failure Mode | Recovery |
|---|---|
| Single Raft node failure | Auto-recovery via Raft. No action needed if quorum maintained. |
| Raft quorum loss (2/3 nodes) | Manual intervention: restore from snapshot or rebuild. Extended downtime. |
| Network partition (K8s ↔ OpenBao) | Fix network path. Sidecar auto-reconnects. No data loss. |
| OpenBao seal event | Manual unseal (or auto-unseal if configured). All nodes must be unsealed. |
| Storage corruption | Restore from snapshot backup. Data loss possible if backups are stale. |
Blast Radius
OpenBao is a single point of failure for the entire customer base. If one OpenBao cluster serves multiple K8s clusters, a single outage affects ALL clusters simultaneously. Centralizing the trust anchor centralizes the failure domain. Each customer should have their own OpenBao cluster.
eBPF Operational Challenges
Kernel Compatibility Matrix
| Provider / OS | Kernel | BTF | eBPF Capability |
|---|---|---|---|
| GKE COS | 5.15+ | Full | Full |
| EKS AL2 | 5.10 | Partial | Degraded |
| EKS AL2023 | 6.1+ | Full | Full |
| AKS Ubuntu 22.04 | 5.15+ | Full | Full |
| AKS Mariner 2.0 | 5.15 | Good | Full |
What's Missing From This Analysis
- Customer-managed node images. Enterprise customers frequently use custom AMIs/images with hardened kernels that may strip eBPF capabilities or use restrictive
seccompprofiles. - Kernel upgrades happen without warning. An eBPF program that works on kernel 5.15.49 might break on 5.15.107 if a BPF verifier change rejects a previously accepted program.
- seccomp and AppArmor policies. Many enterprise clusters block
CAP_BPFandCAP_SYS_ADMIN. The eBPF DaemonSet needs privileged access — ironic for a security product. - CO-RE limitations. "Compile once, run everywhere" has edge cases with raw tracepoints and certain BPF map types.
Debugging eBPF in Production Is Genuinely Hard
No traditional debugger — limited to bpftool and bpf_trace_printk(). Verifier rejections are cryptic. Performance regression diagnosis requires manual correlation. Support burden requires rare and expensive eBPF expertise on the support team.
Testing Matrix Complexity
Minimum test matrix for CI/CD:
- 3 cloud providers × 2+ K8s versions × 2+ node OS versions = 12+ test environments
- Add Autopilot, Fargate variants = 16+ environments
- Add with/without Istio, with/without existing CSI drivers = 32+ combinations
- Add canary + stable sidecar versions during upgrade = 48+ test scenarios
Realistic Assessment
Maintaining this test matrix is a full-time job. Cloud provider test clusters cost $500–2,000/month per environment. At 16 environments minimum, that's $8,000–32,000/month in test infrastructure alone, plus CI/CD engineering time.
Customer Operational Burden
What the Customer Must Operate
| Component | Responsibility | Expertise Required |
|---|---|---|
| OpenBao cluster (HA, Raft) | Customer | Vault/OpenBao administration (rare skill) |
| OpenBao TLS certificates | Customer | PKI management |
| OpenBao backup/restore | Customer | Vault ops |
| OpenBao auth method config | Customer | Vault + K8s auth integration |
| Network egress (NAT/firewall) | Customer | Cloud networking |
| Phantom operator upgrades | Shared | Helm, K8s operator patterns |
| Sidecar version rollouts | Shared | K8s namespace management |
| Monitoring/alerting | Customer | Prometheus, Grafana, AlertManager |
| Incident response | Shared | Multi-domain K8s troubleshooting |
| Node OS upgrades (eBPF) | Customer | eBPF understanding (if enabled) |
| App error handling for stale secrets | Customer's dev teams | Application-level resilience |
"Senior Platform Team" Requirement
Minimum customer team: 1 person with Vault/OpenBao experience (rare), 1 Kubernetes platform engineer, 1 network engineer, plus Prometheus/Grafana access and on-call rotation. Companies with 1–2 DevOps generalists will struggle significantly.
Support Burden Estimation
| Support Tier | Tickets/Month | Common Issues |
|---|---|---|
| L1 (basic) | 5–10 per customer | "Pod won't start" (egress/OpenBao), "secrets not injecting" (labeling) |
| L2 (technical) | 2–5 per customer | eBPF compatibility, sidecar ordering, cache behavior |
| L3 (engineering) | 0–1 per customer | Provider-specific edge cases, CRD conversion bugs |
At 50 Customers
Expect 250–750 L1 tickets/month. This requires a dedicated support team of 2–3 people minimum.
Self-Service vs Managed Service
This product has a natural gravity toward a managed service model. The operational burden of OpenBao + Phantom + monitoring + multi-provider nuances is too high for most customers to self-serve comfortably.
| Model | Revenue | Support Costs | Time-to-Value | Churn Risk |
|---|---|---|---|---|
| Self-service (current plan) | Lower/customer | Higher | Longer | Higher |
| Managed OpenBao + Phantom | Higher/customer | Lower | Faster | Lower |
| Hybrid | Mixed | Mixed | Mixed | Mixed |
Reliability & SLO Assessment
Proposed SLOs
| SLO | Target | Achievable? |
|---|---|---|
| Secret freshness: 99.9% within 2× TTL | Stretch | Achievable under normal conditions. Breached immediately for new pods during OpenBao outages. |
| Webhook availability: not blocked > 5 min | Ambitious | Achievable with circuit breaker. But the escape hatch degrades security — a trade-off customers must understand. |
| Sidecar injection success rate: 99.9% | Reasonable | Achievable. Compatibility check logic should prevent most injection failures. |
Single Points of Failure
| SPOF | Impact | Mitigation |
|---|---|---|
| OpenBao cluster | All secret access fails for all workloads | HA (Raft 5-node), multi-region. Still a single logical dependency. |
| Phantom operator pod | Webhook down → circuit breaker → pods Pending | HA deployment (multiple replicas with leader election). |
| Conversion webhook | CRD reads/writes fail | Runs in same pods as operator. If operator is down, both fail. |
| Network path (K8s ↔ OpenBao) | Same as OpenBao down | Redundant network paths (VPN + internet, dual NAT). Customer responsibility. |
| eBPF DaemonSet | Memory monitoring stops (degraded, not hard failure) | Not a SPOF — eBPF is optional. Failure is graceful. |
Key Failure Modes
Acceptable: Raft Leader Election During Load
5–30 second secret fetch failures during leader transition. Hot cache absorbs this. Sidecar retries with backoff. Non-event for most workloads.
Manageable: Webhook Pod OOMKilled
Circuit breaker opens, new pods stuck Pending. Requires proper resource limits, HPA on webhook deployment, circuit breaker with fast recovery. Needs careful capacity planning.
HIGH RISK: OpenBao TLS Certificate Expires
Certificate expiry is the #1 cause of Vault outages in production. All sidecars fail TLS handshake. This WILL happen. Recommendation: build monitoring for "OpenBao TLS cert expires in < 30 days" into the operator.
UNACCEPTABLE: AKS Removes CSI Driver During Upgrade
Documented provider behavior. Cloakfs volumes become inaccessible until CSI driver is re-installed. Consider not supporting Cloakfs on AKS, or make it a first-class operator-managed component with rapid reconciliation.
Recovery Time Objectives
| Scenario | Target RTO | Realistic RTO |
|---|---|---|
| Webhook pod crash | < 1 min | 30–60 seconds |
| OpenBao single node failure | 0 (Raft failover) | < 5 seconds |
| OpenBao quorum loss | < 1 hour | 1–4 hours (manual) |
| Network partition | Depends on cause | 5 min to hours |
| eBPF program failure | < 1 min | Seconds (auto-detach) |
| Full regional outage | < 1 hour | Hours (if multi-region not configured) |
Scaling Operations (100+ Clusters)
Configuration Surface Area
At 100 clusters × 10 namespaces × (values + labels + CRDs + OpenBao policies) = 4,000+ configuration points. Manual management is untenable.
Problems at Scale
- OpenBao becomes the bottleneck — 100 clusters × 1,000 pods × 10 secrets = 1 million active leases. Raft performance degrades with lease volume.
- Configuration drift — 100 clusters with different K8s versions, node OS versions, and provider configs. Helm values diverge.
- Version skew — 5+ operator versions in production. Sidecar versions even more diverse. Need to support N-2 operator and N-3 sidecar versions.
- Monitoring at scale — 100 clusters × 8+ metrics. Need multi-cluster dashboards (Thanos/Mimir) and alert deduplication.
- Marketplace version management — 3 marketplaces × staggered review timelines = version chaos.
Missing: Centralized Control Plane
At scale, customers need a single pane of glass, centralized policy management, fleet-wide upgrade orchestration, and cross-cluster secret inventory. This is effectively a separate product (fleet management) requiring 3–6 months with 2–3 engineers.
Key Operational Risks (Ranked)
| # | Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|---|
| 1 | OpenBao outage cascades to all workloads | Critical | Medium | 5-node HA, multi-region, managed OpenBao tier |
| 2 | New pods can't start during OpenBao outage | High | Medium | Bootstrap secret mode or longer-lived sealed cache |
| 3 | OpenBao TLS cert expiry | High | High | Built-in cert expiry monitoring and alerting |
| 4 | Customer can't set up OpenBao HA | High | High | Managed OpenBao offering, detailed runbooks |
| 5 | AKS removes CSI driver on upgrade | High | High | Don't rely on custom CSI on AKS |
| 6 | Multi-provider test matrix exceeds capacity | Medium | High | Focus on one provider first |
| 7 | eBPF kernel compat break after upgrade | Medium | Medium | Make eBPF optional and default-off |
| 8 | Marketplace release delays critical patches | Medium | Medium | Direct-install as primary distribution |
| 9 | Support burden exceeds capacity at scale | Medium | High | Managed service tier, better self-service |
| 10 | CRD conversion webhook failure | Medium | Low | HA operator, rollback runbooks |
Recommendations for Simplification
1. Offer Managed OpenBao as a Core Product Tier
This single change eliminates the top 4 operational risks for customers and accelerates time-to-value from weeks to hours. Pricing: Starter (shared, included) → Professional (dedicated 3-node, +€500/mo) → Enterprise (5-node multi-region, +€2,000/mo).
2. Single Provider First (GKE Standard)
"Works perfectly on one provider" beats "works partially on three." Reduces test matrix from 48+ to 6 scenarios. CI/CD cost drops from $8K–32K/month to $1K–3K/month. Expand to EKS at month 7–9, AKS at month 10–12.
3. Make eBPF Explicitly Optional and Advanced
Don't make it a default feature. The operational complexity (kernel compatibility, debugging difficulty, performance impact) is disproportionate. Most customers care about secrets not being in etcd — they don't need kernel-level memory monitoring.
4. Defer Cloakfs
CSI driver landscape is too fragmented across providers. AKS is unreliable, Autopilot blocks it, Fargate can't run it. Ship Phantom (secrets) first, prove the market, then tackle at-rest encryption.
5. Direct Helm Install Before Marketplace
Marketplace listings add 2–3 months of packaging/certification work per provider. Ship directly first, add marketplace presence as a growth channel, not a launch requirement.
6. Build an Installation Wizard (phantom-cli)
A CLI tool that walks customers through: OpenBao setup → network connectivity → Phantom install → validation. Replaces 10 pages of docs with a guided experience. Reduces L1 support tickets by 50%.
7. Add "Bootstrap Secret" Mode
Allow critical secrets to be pre-provisioned so new pods can start during an OpenBao outage, then refresh when available. Addresses the "single biggest operational risk" identified in this assessment.
8. Monitor OpenBao TLS from Inside the Operator
Don't rely on the customer to monitor certificate expiry. Have the sidecar or operator check the certificate chain and alert when expiry is < 30 days.
Improvement Roadmap: 5/10 → 7.5/10
Targeted improvements that are high-impact but moderate-effort — moving the DevOps score from 5/10 to 7.5/10.
| Improvement | Score Impact | Effort | Risks Addressed |
|---|---|---|---|
| A1. Managed OpenBao-as-a-Service | +1.5 | 6–8 weeks | #1, #3, #4 |
A2. phantom-cli bootstrap tool | +0.5 | 2–3 weeks | Time-to-value, L1 tickets |
| A3. Bootstrap secret mode | +0.75 | 2–3 weeks | #2 (biggest operational risk) |
| A4. Single-provider launch (GKE) | +0.5 | 1 week (planning) | #6 |
| A5. Operational runbook templates | +0.5 | 1 week | Troubleshooting, MTTR |
| A6. SaaS observability dashboard | +0.5 | 4–6 weeks | Monitoring burden |
| A7. Auto-upgrade operator | +0.25 | 3–4 weeks | Day-2 complexity, version skew |
| Total | +4.5 (capped at 7.5/10) | ~16–24 weeks | 8 of top 10 risks |
Priority Order
- A1 + A3 (Managed OpenBao + Bootstrap Secrets) — address the two highest-severity risks
- A2 + A5 (CLI + Runbooks) — fastest to build, immediate support burden reduction
- A4 (GKE-first strategy) — a planning decision, not engineering work
- A6 + A7 (SaaS dashboard + Auto-upgrade) — polish for Day-2 experience
Realistic vs Aspirational
What's Realistic (5–7 person team)
- Phantom core (webhook + sidecar + secrets) on a single provider
- Pre-flight connectivity checks
- Circuit breaker pattern
- Canary injection
- 3-tier secret caching (gap: new pods during outages)
What's Aspirational
- Full feature parity across 3 providers + Autopilot + Fargate
- eBPF monitoring as a standard (not optional) feature
- Cloakfs CSI across all providers
- Simultaneous marketplace listings on all three
- Customer self-service at scale without managed OpenBao