Blue-Green Deployments: How to Ship Without Downtime

The Blue-Green Pattern

Blue-green deployment maintains two identical production environments. At any time, one (blue) is serving traffic and the other (green) is idle or running the new version. A traffic switch flips users from blue to green; rollback flips them back.

The model is conceptually simple, which is the entire appeal. Compared to in-place rolling updates, there’s no period when some users are on the old version and some on the new — the switch is atomic from the user’s perspective.

Implementation Approaches

At the DNS level: change a CNAME or weighted record to point to the new environment. Simple but slow — DNS TTLs and resolver caching delay propagation by minutes or longer.

At the load balancer level: shift target groups or backend pools. Faster than DNS, atomic from the LB’s perspective, and easy to roll back.

At the service mesh level: shift traffic between deployments inside Kubernetes via Istio, Linkerd, or similar. Most flexible and integrates with canary patterns.

The Database Problem

Stateless services blue-green cleanly. Stateful services do not. Two environments sharing one database means schema changes need to be backward-compatible across both versions.

The standard pattern: expand-contract migrations. Add new columns or tables without removing old ones. Deploy the new code that uses both. Once traffic is fully cut over and confidence is established, remove the old columns or tables in a separate deployment. The Google SRE workbook covers the patterns for managing state during deployments.

When Blue-Green Is the Wrong Choice

Blue-green doubles infrastructure cost during the switch window. For some workloads — large clusters, expensive accelerators — that’s prohibitive.

Canary deployments solve a different problem: testing new versions on a small percentage of traffic before full rollout. Blue-green is binary; canary is gradual. For risk-sensitive changes, canary is usually better.

Rolling updates are the default in Kubernetes for a reason: they don’t require double infrastructure and they handle most low-risk deployments fine.

Operational Considerations

Practice rollbacks. A blue-green setup that can deploy but has never been rolled back will fail to roll back when you need it.

Verify the green environment before switching. Run smoke tests, integration checks, and synthetic transactions. The switch is fast; the verification before it is what makes blue-green safe.

Connection Draining and Long-Lived Connections

The atomic-switch claim of blue-green deployments hides a real complication: long-lived connections. WebSockets, gRPC streams, and database connection pools don’t move with a load balancer switch. Existing connections continue to hit the old environment until they’re closed.

The practical pattern: configure connection draining on the load balancer to allow in-flight connections to complete, set a maximum drain duration (5-10 minutes is common), and accept that some connections will be forcibly closed at the boundary. Clients with reconnect logic handle this gracefully; clients without it see errors at switchover.

Cost Optimization for Blue-Green

Maintaining two full production environments full-time is expensive. The standard cost optimization: keep the old environment scaled down to minimal capacity after cutover. It’s there for rollback but not consuming full production resources.

More aggressive: tear down the old environment entirely after a confidence period (24-48 hours of stable green operation). Rollback then requires redeploying the old version, which is slower but cheaper. The right choice depends on rollback frequency and cost sensitivity.

DNS-Based Switching Gotchas

DNS-based blue-green switching is appealingly simple — change a record, traffic moves. The reality is more complex. DNS TTLs are advisory; resolvers and clients sometimes cache longer than the TTL suggests. Connection pools holding TCP connections to the old environment don’t notice DNS changes at all.

The practical implication: DNS switches are slow and partial. Some traffic moves immediately, some moves over minutes, some never moves until clients reconnect. For atomic-feeling switches, the load balancer layer is a better choice.

Where DNS works fine: switches that don’t need to be instant, especially across regions or providers. AWS Route 53 weighted records can gradually shift traffic over hours, which is useful for some migrations.

Testing the Green Environment

The window between green-deployed and green-live is where verification happens. Smoke tests cover basic functionality. Synthetic transactions exercise critical user paths. Performance benchmarks ensure no regression.

Run these against the green environment directly (via internal DNS or specific routing). Compare results to the same tests against blue. Differences that exceed thresholds block the switch.

Some teams also do shadow traffic — mirror production traffic to green without serving responses to users. Useful for validating performance and behavior at real load without user impact.

Release Notes and Changelog Generation

Automated release notes from commit history close the loop between code changes and user-facing communication. Tools like release-drafter, semantic-release, and changesets generate changelogs from conventional commits or PR labels.

The discipline of writing PR titles and commit messages for downstream consumption pays back here. PR titles become changelog entries; clear titles make for clear changelogs.

For libraries with external users, automated semver bumping based on commit type (feat: minor, fix: patch, breaking change: major) reduces manual version management. The same tooling can publish to npm, PyPI, or other package registries on merge.

Security in CI/CD

CI/CD systems hold significant power: they can build code, sign images, push to registries, and deploy to production. Securing them matters.

Standard hardening: least-privilege credentials for each step, signed artifacts at each stage, audit logs of all pipeline executions, separation between build environments and production credentials.

Supply chain attacks via compromised CI are a real and growing threat. SLSA (Supply chain Levels for Software Artifacts) provides a framework for thinking about CI/CD security maturity. Most organizations land at SLSA level 1-2; reaching level 3 requires real investment but provides meaningful guarantees.

Pipeline Templating and Reuse

At scale, copy-pasted CI configuration becomes a maintenance burden. Every change to the standard pipeline requires touching dozens of repos.

Templating mechanisms vary by platform: GitHub Actions composite actions and reusable workflows, GitLab CI includes and templates, Jenkins shared libraries. Each provides a path to defining pipeline logic once and consuming it from many repositories.

The pattern that works: a small platform team maintains pipeline templates; service teams consume them by reference. Service-specific customization happens via variables and minimal local overrides. Template changes can be reviewed and tested centrally before propagating.

Build Cache and Performance

Build performance compounds at scale. A 30-second improvement on every pipeline run translates to hours per day across an organization.

Caching strategies matter most: dependency caches (npm, Maven, pip), Docker layer caches, intermediate build artifacts. Each cache type has different invalidation rules and storage requirements.

Remote caches shared across runners deliver the biggest improvement for monorepos and matrix builds. Bazel remote cache, Turborepo Remote Cache, and Nx Cloud all provide this for their respective ecosystems. Build times that dropped from 10 minutes to 1 minute aren’t unusual.

Putting It Into Practice

The patterns described throughout this article aren’t all equally important for every team. The right starting point depends on current state.

For teams without consistent CI/CD: focus first on basic pipeline reliability and speed. Inconsistent or slow pipelines undermine every other improvement you might try later.

For teams with working pipelines but high change failure rate: invest in better testing, smaller deployments, and explicit rollback procedures. The shift from ‘shipping is scary’ to ‘shipping is routine’ transforms how teams operate.

For teams with reliable CI/CD looking to advance: progressive delivery, deployment frequency improvement, and DORA metric tracking are the natural next steps. Each builds on the foundation rather than replacing it.

The advancement isn’t linear, and not every team needs every capability. Match the practices to the team’s actual constraints and let the rest wait.

Frequently Asked Questions

How much extra infrastructure does blue-green require?

Roughly 2x during the switch window. After cutover, the old environment is reclaimed or kept warm for fast rollback.

Can I do blue-green for a stateful service?

Yes, with expand-contract migrations. The schema needs to support both code versions simultaneously.

Blue-green vs canary?

Blue-green is an atomic switch. Canary is gradual percentage-based shifting. Canary is safer for risky changes; blue-green is simpler when you can validate green first.

How do I roll back from green to blue?

Reverse the traffic switch. The blue environment should be kept warm for at least one deployment cycle to make this fast.