SRE On-Call Best Practices: Building Sustainable Incident Response

The Goal of On-Call

On-call exists to ensure that someone responds to production problems within a defined time window. That’s the entire purpose. Everything else — escalation paths, severity definitions, post-incident reviews — supports that goal. The Google SRE book chapter on being on-call defines healthy on-call practices at the source of the SRE discipline.

Healthy on-call rotations are characterized by alerts that fire on real problems, runbooks that resolve common issues quickly, and shifts that don’t dominate the rest of an engineer’s life.

Rotation Design

Weekly rotations are the most common. They’re long enough for the on-call engineer to context-switch into the role and short enough that fatigue is bounded. Daily rotations create handoff churn; monthly rotations create burnout.

A rotation needs enough engineers that each person’s turn comes around no more than once every five or six weeks. Smaller rotations mean each person is on-call too often and the wear-out is real. Pair every primary with a secondary.

Alert Hygiene

Every alert that wakes someone up should require human action that couldn’t have been automated. Alerts that fire on causes (high CPU, low disk) without symptoms (degraded user experience) should be tickets, not pages.

Track page volume. If your team gets paged more than twice per shift on average, the alerts are the problem, not the systems. Spend a sprint deleting or downgrading low-value alerts. The team’s quality of life improves immediately. The Prometheus alerting documentation covers alert routing and grouping patterns.

Runbooks That Work Under Stress

Runbooks are written for the engineer at 3 AM with a half-loaded mental model. They should contain: the symptom that triggers them, the most likely causes, the commands to diagnose, the commands to remediate, and when to escalate.

Test runbooks during incidents. After every paging event, the on-call engineer updates the runbook with anything that was missing or wrong. Stale runbooks are worse than no runbooks — they sap confidence.

Compensation and Recovery

On-call is work. Compensate it explicitly, either with on-call pay, comp time, or both. Teams that pretend on-call is free pay for it in attrition.

After a rough shift — multiple pages, especially overnight — let the engineer take comp time. After a major incident, the on-call engineer leads the post-incident review the following day, not the same night.

Handoffs and Documentation

The transition between on-call engineers is where context gets lost. A 5-minute Slack handoff at the end of every shift — open incidents, ongoing concerns, things to watch — bridges the gap. Some teams use formal handoff documents; for many, an asynchronous Slack message is sufficient.

What gets handed off matters more than the format. Active incidents with their current status. Recurring issues that haven’t yet been root-caused. Maintenance windows scheduled during the next shift. New deployments going out. The incoming engineer should know what they’re walking into before the pager moves.

Compensation Models

On-call compensation varies enormously. The most common model is a flat per-shift payment ($100-$300 per weeknight, $200-$500 per weekend night), with additional pay or comp time for actual pages handled. Some companies do percentage-of-salary stipends instead.

What matters more than the specific number is that compensation exists explicitly. Teams that treat on-call as ‘part of the job’ with no separate recognition see higher attrition. Compensation also signals organizational seriousness about reliability work — paying for it implies valuing it.

Practicing Without Real Incidents

Real incidents are the worst training ground. Tabletop exercises, where the team walks through hypothetical incidents and the responses, build skills without the pressure of a real outage. Run them quarterly. The DORA research at dora.dev tracks restoration time as a key metric and provides benchmark data across industries. Pick scenarios from past incidents, near-misses, or imagined failure modes.

The exercises surface gaps: outdated runbooks, missing contact information, escalation paths nobody knows. Each gap surfaced in a tabletop is a gap that won’t bite during a real incident.

Game days extend the concept further: live failure injection in pre-production environments. Teams respond using real tools and runbooks, then debrief. The practice translates directly to better real-incident response.

Burnout Detection and Recovery

On-call burnout is real and predictable. Warning signs: an engineer’s calendar full of pages in the past month, repeated negative feedback in on-call retrospectives, declining engagement in incident reviews, and increased sick days.

Recovery requires reducing load, not waiting for it to pass. Take the burning-out engineer off rotation for a cycle or two. Investigate why their shifts have been hard and fix the underlying alert noise or system fragility.

Organizations that don’t actively manage on-call wellbeing pay for it in attrition. The replacement cost for an experienced SRE far exceeds the cost of reducing their on-call load.

Team Culture and Practices

The tooling matters; the culture matters more. Teams with strong DevOps practices and middling tools usually outperform teams with state-of-the-art tools and weak culture.

Core practices: shared ownership of production reliability, blameless incident response, regular retrospectives, deliberate investment in developer experience. None require specific tools; all require sustained leadership attention.

Maturity grows gradually. A team that adopts blameless postmortems but still has weekly all-hands-on-deck deployments hasn’t internalized the practice. Watch the behaviors during stress, not the documented procedures.

Continuous Improvement Cadence

The DevOps movement’s emphasis on continuous improvement isn’t a slogan; it’s a practical requirement. Systems decay, requirements change, and tools age. Maintaining a healthy engineering organization requires deliberate, ongoing investment.

Quarterly retrospectives at the team and organization level surface what’s working and what isn’t. The output is concrete commitments to change — not abstract aspirations.

Track changes from retrospectives. Teams that don’t follow through on retrospective actions eventually stop running retrospectives. Demonstrated follow-through builds the trust that makes future retrospectives valuable.

Hiring and Team Building

DevOps and platform engineering hiring is competitive. The job market pays well; experienced engineers are in demand. Building teams requires investment in both compensation and culture.

What attracts strong candidates: meaningful work on systems they can directly impact, clear ownership boundaries, modern tooling, sustainable on-call practices, growth opportunities. What drives them away: legacy systems with no migration plan, unclear ownership, oppressive on-call, no investment in their growth.

Internal mobility matters too. Engineers in adjacent disciplines (backend development, networking, security) often become strong platform engineers with appropriate support. Building hiring pipelines that include internal transfers expands the talent pool.

Vendor Selection and Tool Procurement

DevOps organizations buy many tools. Each tool selection is a multi-year commitment with implicit ongoing costs (licenses, training, integration). The selection process deserves more attention than it usually gets.

Standard evaluation criteria: feature fit, total cost of ownership over three years, exit cost (how hard is it to migrate away later), vendor stability, and integration with existing tooling.

Avoid the trap of evaluating only on features in initial demos. The features matter; so do the rough edges that surface in week three of actual use. Trial periods and reference customers in similar environments surface what marketing doesn’t.

Practical Next Steps

For teams beginning their DevOps or platform engineering journey, the temptation is to adopt every recommended practice at once. The teams that succeed tend to focus on one or two specific improvements at a time, build the habit, and then expand scope.

Concrete next steps worth considering: instrument deployment frequency and lead time, even informally. Run a blameless retrospective after the next incident. Document the platform decisions that have been made implicitly. Survey the team about pain points and tackle the top two.

Each of these is a small investment with compounding returns. The team that runs retrospectives quarterly accumulates institutional learning that the team without them doesn’t. The team that tracks DORA metrics over a year sees trends they would otherwise miss.

The work doesn’t end. Engineering organizations are living systems that decay without active maintenance. The practices described throughout this article are tools for sustained improvement, not destinations to reach and stop.

Frequently Asked Questions

How big does a rotation need to be?

Six to eight people minimum for a comfortable weekly rotation. Fewer than that and you’re burning people out.

Should developers be on-call for their own services?

Yes, with caveats. They should be on-call only after a service has stable observability, runbooks, and alert hygiene.

How do I reduce alert fatigue?

Track page volume per engineer per shift. Anything above two pages per shift on average means the alerts need work.

What’s the difference between page and ticket?

Pages require immediate action. Tickets can wait until business hours. If something can wait, it shouldn’t page.