What Going Containerized Committed You To

Mon, Apr 8, 2024
9-minute read

The original board memo cited two reasons for the container migration: deployment consistency and faster shipping. Four years later, the bill arrived in five separate programs (orchestration, networking, security, observability, a platform team), none of which had been on the slide. Every line item was justifiable. None of them was in the original commitment.

The platform team grew by one engineer at a time. The service mesh arrived as a P0 fix. The logging bill compounded quarter by quarter until finance was the one who escalated. No single line item triggered an architectural review on its own. The pattern that holds across most container migrations is exactly this: the running total stays invisible because no one is assembling it.

flowchart TB Sale["Original sale
consistency,
portability, parity"] --> Real["What shipped"] Real --> Orch[Orchestration program] Real --> Net[Networking model] Real --> Sec[New security posture] Real --> Obs[Observability rewrite] Real --> Team[Platform team] style Sale fill:#eaf2fa style Real fill:#fff5e0 style Orch fill:#fff5e0 style Net fill:#fff5e0 style Sec fill:#fff5e0 style Obs fill:#fff5e0 style Team fill:#fff5e0

Figure 1. The original conversation listed two or three reasons. The bill arrived in five separate programs, each owned by a different team, none of them prompting an architectural review on its own. The lack of a single reflection point is what kept the running total invisible.

The promise and the bill

The promise was clean. Run the same image in dev, staging, and production. Stop debugging environment drift. Ship faster. Match where the industry was going. Most of those promises were kept. What didn’t get discussed at the time was what the receipt looked like a few years out.

The receipt is rarely a single check. It is a platform team that wasn’t in the original budget, a service mesh that nobody planned to need, a security program that had to be rebuilt from a different starting point, and an observability stack that costs more than the underlying infrastructure. Each line item arrived gradually, owned by a different team, with no single moment that triggered a review.

I’m not arguing those costs were wrong. I’m arguing they should have been visible at the start. The decision frame “containers, yes or no” couldn’t carry that weight. The honest frame, “are we ready to run a platform organization,” would have surfaced the same conversation three years earlier.

The orchestration commitment

You didn’t adopt containers. You adopted Kubernetes, or a managed equivalent that’s still Kubernetes underneath. The day someone in the org standardized on kubectl, the company became a Kubernetes shop, and that’s a different commitment than running Docker on a VM.

The platform team that wasn’t in the original budget arrives within two years. The control-plane upgrade treadmill, the deprecated APIs, the IRSA-and-CRD vocabulary that didn’t exist in the previous VM-based stack: all of this requires engineers who have already learned that material somewhere else. The engineers you had at adoption time are rarely the engineers who want to spend the next several years on cluster operations.

The hiring market shift is the part that surprises leadership most. The engineers who do cluster operations well are scarce and expensive, and the price doesn’t go down. The org chart looks different than it did at adoption. That’s not a failure. It’s the commitment, made visible.

Below is a representative Deployment spec that illustrates the operational floor Kubernetes requires: resource limits, a liveness probe, and a rollout strategy. These aren’t optional refinements; they’re the minimum responsible configuration for a workload running at production scale.

# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: commerce
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
        - name: order-service
          image: registry.example.com/order-service:v2.4.1
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

The networking model shift

Containers turned east-west traffic into a first-class problem. The old architecture had a load balancer, a few VLANs, and a handful of internal services calling each other over well-known DNS names. The new architecture has hundreds of pods talking to each other over an overlay network that the previous network team didn’t design.

A service mesh follows. Not because anyone wanted one, but because mTLS, retries, traffic policy, and circuit-breaking have to live somewhere, and the application teams aren’t going to implement them in every service. CNI plugins, IP exhaustion in flat networks, pod-to-pod policy: each of these arrived as a problem the platform team had to learn on the way to solving. The Dockerfile-era reasoning the org had built up turned out to cover a small fraction of what containerized networking demanded. The architectural decisions hidden in a Dockerfile compound at the cluster level, and the team often discovers that compound only after the first cross-namespace incident.

I’m not arguing service meshes are wrong. I’m arguing they’re the cost of microservice traffic at this scale, not a separable decision. A team that says “we adopted containers but not a service mesh” usually means “we adopted containers and have not yet experienced the incident that would have required one.”

The new security model

The security model that worked for VMs (host-level controls, network segmentation, a security agent on every box) doesn’t carry over. Container scanning, image provenance, runtime detection, admission controllers: these are net-new programs, not extensions of existing ones.

The vocabulary shift alone is expensive. Security teams that spent a decade on host-based control catalogs find themselves learning pod security standards, CRD-level admission policy, and the difference between cluster-level, namespace-level, and pod-level controls. Some senior engineers retire and don’t retrain. The teams that replace them don’t have the institutional context for which workloads are sensitive and which are noise.

The shared-kernel question is the one most teams under-examine. For most workloads it doesn’t matter. For the ones that hold customer data or carry regulatory weight, “containers run on a shared kernel” is a sentence the security team should have a clean answer to. Most don’t. Whether your container runtime decision matters here for those workloads depends on a threat model the org rarely revisits after adoption.

Secrets management got harder, not easier. The fact that secrets injection is a solved problem in container platforms hides the harder work: who can read which secret, what gets rotated, what happens when the secret backend has an outage. The old model put the secret on disk and hoped the box wasn’t compromised. The new model multiplies the moving parts and makes the audit story harder, not simpler.

The observability rewrite

Logs, metrics, and traces are the place the bill lands hardest. None of the previous tooling assumed ephemeral workloads, dynamic IPs, and high cardinality as the default state of the world.

A SaaS company I worked with, one that had otherwise run the Kubernetes migration well, saw their log spend grow by an order of magnitude in the first year and a half. Finance noticed before engineering did. The cardinality cleanup project that followed consumed a couple of engineers for roughly half a year, and the savings paid back the project within a quarter. The lesson wasn’t “logging is too expensive.” It was that “we’ll figure out logging later” is a budget item that compounds quietly until someone in finance is the one who escalates.

The skills shift inside the operations team is the part that doesn’t show up on a budget line. The senior operator who used to read syslog now has to read distributed traces. The dashboards they used for years are useless against a fleet of pods that shift every deploy. Some of those operators retrain. Some leave. The cost of replacing them is in the platform team’s hiring budget, not the observability budget, but it’s the same bill.

What each commitment cost

The table below assembles the bill each program presented, measured against the original commitment language. None of these numbers appeared in the migration proposal.

Program	Original frame	Actual commitment	Common surprise
Orchestration	“Docker on managed clusters”	Kubernetes expertise, upgrade treadmill, dedicated platform team	Hiring cost: Kubernetes engineers are expensive and scarce
Networking	“Same load balancer, new instances”	Service mesh, CNI selection, pod-to-pod policy, IP planning	Mesh arrives as a P0 after an incident, not a planned project
Security	“Extend current host controls”	Net-new programs: scanning, admission, runtime detection	Security team has to retrain or be replaced
Observability	“Move logs to a new backend”	Full-stack rewrite for ephemeral workloads; cardinality planning	Log spend grows before the team notices
Platform team	“DevOps helps coordinate”	Dedicated team, often 4-10 engineers, within 18-24 months	Budget appears one headcount at a time, never as a line item

flowchart TD Decision["Container adoption
approved"] --> Orch["Orchestration
complexity"] Decision --> Net["Networking
redesign"] Decision --> Sec["Security
model gap"] Decision --> Obs["Observability
rewrite"] Orch --> PT["Platform team
funded"] Net --> Mesh["Service mesh
P0 incident"] Sec --> Prog["Net-new
security program"] Obs --> Fin["Finance
escalates cost"] PT --> Bill["Running total
visible"] Mesh --> Bill Prog --> Bill Fin --> Bill style Decision fill:#eaf2fa style Bill fill:#fff5e0 style Mesh fill:#fdd style Fin fill:#fdd

Figure 2. Each program arrives through a different trigger: a staffing conversation, a P0 incident, a finance escalation. Because no single trigger prompts a consolidated review, the running total stays invisible until someone assembles it deliberately.

Reading this for the next decision

Serverless, edge computing, AI platforms. Each of these will arrive with the same shape of bill: new programs you don’t currently run, new skills you don’t currently have, new cost surfaces you can’t currently model, new incident shapes the on-call rotation hasn’t seen before.

Before adopting any of them, ask the questions the container conversation didn’t ask. What new programs does this commit us to running? What capabilities does it require that we don’t have? What does an incident look like, and who’s on call for it? What does the bill look like five years from now, not six months from now?

The honest answer is sometimes “not yet,” and that answer is a sign of organizational health, not conservatism. The teams that adopt every new platform on the same cadence end up running five platform programs, none of them well. The teams that adopt one at a time, and read the bill each time, end up with one program they understand.

The CTOs who finish reviewing their container migration usually end the conversation by asking what the AI infrastructure conversation should look like. The honest answer: it should look like the container conversation they wish they’d had four years earlier. Name the bill before you sign for it.

containers containers kubernetes architecture