Infrastructure as Code Best Practices: Writing Maintainable Terraform

Why Terraform Codebases Decay

Most Terraform decay follows a predictable pattern. Early on, everything lives in one root module because there’s not much yet. The team adds resources directly. State grows. Someone realizes they need modules. They modularize inconsistently. Variables proliferate. Naming drifts. Two years later, a simple change requires reading half the repo to understand the blast radius.

The work to prevent this is mostly upfront: a module structure that anticipates growth, a state layout that scales, and naming conventions that make sense before there are 500 resources.

Module Structure

The useful unit of modularity is a deployable thing: a service, an environment, an application. Don’t modularize every resource — wrapping every aws_iam_role in a one-resource module is overhead without benefit.

Three-tier structure works for most teams: (1) building-block modules for common patterns (a VPC, a Kubernetes cluster, a managed database), (2) composition modules that combine building blocks for an environment, (3) root modules that instantiate composition modules per environment.

State Management

One state file per deployable unit. Splitting state too finely creates orchestration overhead; not splitting enough means a single terraform apply takes 20 minutes and blocks the whole team.

Remote state with locking is non-negotiable. S3 + DynamoDB for AWS, GCS for GCP, HCP Terraform for hosted. Local state has no place in any team workflow.

Variables, Outputs, and Naming

Module inputs should be the things that change between instantiations. Hardcode the things that don’t. A module with 40 input variables is usually wrong — most of those variables only ever take one value.

Naming conventions: pick one and stick with it. The most common are kebab-case for resource names and snake_case for Terraform identifiers. Prefix resource names with environment and service: prod-payments-api-alb, not just alb.

Testing and Drift Detection

Terraform plans in CI catch syntax errors and surface change preview before merge. Tools like Terratest and tflint extend that with policy and static analysis.

Drift detection — running terraform plan against production on a schedule and flagging changes — catches manual changes that bypassed code. It’s cheap to set up and pays back the first time someone clicks a button in the AWS console.

Testing Infrastructure Code

Terraform code tested only by running terraform apply against production is untested code. The testing layers worth investing in: static analysis (tflint, checkov, tfsec), policy as code (OPA, Sentinel), and integration tests (Terratest).

Static analysis catches the obvious issues — missing tags, deprecated resource arguments, security misconfigurations. Policy as code enforces organizational rules at plan time. Integration tests spin up real infrastructure in a sandbox account, verify it works, and tear it down. They’re slow and expensive but catch problems static analysis can’t.

Module Versioning and Distribution

Internal Terraform modules deserve the same versioning discipline as application code. Tag every module change with semver. Pin module versions in consuming code; never reference ‘main’ or ‘master’ branches.

Distribution mechanisms: a private Terraform registry (Terraform Cloud, Spacelift, or open-source like Tfregistry), git tags (the simplest path), or an internal artifact registry. Each works; pick one and standardize.

Working Across Environments

The dev/staging/prod split is the most common multi-environment pattern. Each environment gets its own Terraform workspace or root module, with shared module definitions and per-environment variable values.

The pattern fails when teams parameterize too aggressively. A module that needs 30 inputs because each environment differs in 30 ways is unmaintainable. Either accept some duplication (each environment has its own config that looks similar to others) or refactor to make the variation explicit.

Promotion between environments works best when the difference is just variables — same code, different inputs. Promotion patterns that require code changes between environments leak environment-specific logic into modules and create test gaps.

Terraform vs Pulumi vs CDK

Terraform remains the broad default for infrastructure as code. The Terraform documentation covers everything from getting started to advanced module patterns. Pulumi offers infrastructure-as-real-code in Python, TypeScript, Go — more flexible, more familiar to developers, less ecosystem maturity than Terraform. AWS CDK targets AWS specifically with TypeScript or Python, synthesizing CloudFormation underneath.

Each has tradeoffs. Terraform’s declarative HCL is verbose but easy to review. Pulumi’s full-programming-language approach is powerful but easier to write unreviewable code. CDK is specific to AWS and inherits CloudFormation’s limitations.

For most teams, Terraform’s broad community support and multi-cloud capability tip the balance. Teams already deep in TypeScript and AWS-only might prefer CDK; teams that want testing as a first-class concern often choose Pulumi.

Team Culture and Practices

The tooling matters; the culture matters more. Teams with strong DevOps practices and middling tools usually outperform teams with state-of-the-art tools and weak culture.

Core practices: shared ownership of production reliability, blameless incident response, regular retrospectives, deliberate investment in developer experience. None require specific tools; all require sustained leadership attention.

Maturity grows gradually. A team that adopts blameless postmortems but still has weekly all-hands-on-deck deployments hasn’t internalized the practice. Watch the behaviors during stress, not the documented procedures.

Continuous Improvement Cadence

The DevOps movement’s emphasis on continuous improvement isn’t a slogan; it’s a practical requirement. Systems decay, requirements change, and tools age. Maintaining a healthy engineering organization requires deliberate, ongoing investment.

Quarterly retrospectives at the team and organization level surface what’s working and what isn’t. The output is concrete commitments to change — not abstract aspirations.

Track changes from retrospectives. Teams that don’t follow through on retrospective actions eventually stop running retrospectives. Demonstrated follow-through builds the trust that makes future retrospectives valuable.

Hiring and Team Building

DevOps and platform engineering hiring is competitive. The job market pays well; experienced engineers are in demand. Building teams requires investment in both compensation and culture.

What attracts strong candidates: meaningful work on systems they can directly impact, clear ownership boundaries, modern tooling, sustainable on-call practices, growth opportunities. What drives them away: legacy systems with no migration plan, unclear ownership, oppressive on-call, no investment in their growth.

Internal mobility matters too. Engineers in adjacent disciplines (backend development, networking, security) often become strong platform engineers with appropriate support. Building hiring pipelines that include internal transfers expands the talent pool.

Vendor Selection and Tool Procurement

DevOps organizations buy many tools. Each tool selection is a multi-year commitment with implicit ongoing costs (licenses, training, integration). The selection process deserves more attention than it usually gets.

Standard evaluation criteria: feature fit, total cost of ownership over three years, exit cost (how hard is it to migrate away later), vendor stability, and integration with existing tooling.

Avoid the trap of evaluating only on features in initial demos. The features matter; so do the rough edges that surface in week three of actual use. Trial periods and reference customers in similar environments surface what marketing doesn’t.

Practical Next Steps

For teams beginning their DevOps or platform engineering journey, the temptation is to adopt every recommended practice at once. The teams that succeed tend to focus on one or two specific improvements at a time, build the habit, and then expand scope.

Concrete next steps worth considering: instrument deployment frequency and lead time, even informally. Run a blameless retrospective after the next incident. Document the platform decisions that have been made implicitly. Survey the team about pain points and tackle the top two.

Each of these is a small investment with compounding returns. The team that runs retrospectives quarterly accumulates institutional learning that the team without them doesn’t. The team that tracks DORA metrics over a year sees trends they would otherwise miss.

The work doesn’t end. Engineering organizations are living systems that decay without active maintenance. The practices described throughout this article are tools for sustained improvement, not destinations to reach and stop.

Frequently Asked Questions

How big should a Terraform state file get?

Above a few hundred resources or 10+ minute applies, split. Below that, splitting is overhead.

Should I use Terragrunt?

Worth evaluating if you have many environments with similar structures. Adds complexity for small teams.

How do I handle secrets in Terraform?

Don’t put them in code or state. Reference them from a secret manager at plan/apply time, and use sensitive = true on outputs.

Terraform or OpenTofu?

OpenTofu is the open-source fork after HashiCorp’s license change. Functionally equivalent today. Choose based on tooling ecosystem and license preference.