Kubernetes Networking: CNI Plugins Compared (Calico, Flannel, Cilium)

What CNI Does

The Container Network Interface (CNI) is the spec that Kubernetes uses to set up pod networking. Every cluster has a CNI plugin; the choice affects how pods get IPs, how traffic is routed, what network policies are enforceable, and what overhead the data plane carries. The Kubernetes networking documentation explains the network model all CNI plugins must satisfy.

All major plugins satisfy the basic Kubernetes networking model: every pod gets an IP, every pod can talk to every other pod by default, and services route via kube-proxy or equivalent. The differences are in performance, policy, and operational characteristics.

Flannel: Simple and Limited

Flannel is the simplest CNI plugin. It assigns pod CIDRs per node, encapsulates traffic in VXLAN (or uses host-gw for L2 connectivity), and gets out of the way.

Pros: easy to operate, lightweight, no extra moving parts. Cons: no NetworkPolicy enforcement, no advanced observability, fewer features overall. Flannel is reasonable for small clusters where you don’t need policy enforcement; for most production use cases, the lack of policy is a hard limit.

Calico: Policy-First

Calico is policy-first. It implements NetworkPolicy comprehensively and adds Calico-specific policy resources for more advanced rules. The data plane uses standard Linux networking (iptables, IPVS, or eBPF in recent versions).

Pros: rich policy support, mature, widely deployed, supports BGP for routing without encapsulation. Cons: more components than Flannel, learning curve for the policy model. Calico is the default choice for many teams that need real network policy enforcement.

Cilium: eBPF-Native

Cilium uses eBPF to implement the data plane in the kernel directly. The result is high performance, deep observability, and capabilities that go beyond the standard Kubernetes networking model — L7 policy, transparent encryption, service mesh features without a sidecar. The Cilium documentation covers installation, network policies, and Hubble observability.

Pros: best-in-class observability (Hubble), L7-aware policy, can replace kube-proxy with eBPF-based service routing. Cons: newer than Calico, eBPF means kernel version dependencies, larger conceptual surface area. Cilium has become the default choice for new clusters that want modern features.

Choosing Between Them

Flannel: small clusters, no policy requirements, want minimum operational overhead.

Calico: mature, broadly compatible, policy enforcement matters, BGP integration with on-prem networks.

Cilium: greenfield clusters, want best-in-class observability and modern features, kernel version supports eBPF reliably.

Cloud-managed alternatives: EKS supports the VPC CNI by default; GKE has its own; AKS uses Azure CNI. They integrate tightly with the underlying VPC but vary in policy capability.

NetworkPolicy Patterns

NetworkPolicy syntax is awkward. The standard mental shortcut: start with a default-deny policy in the namespace, then add explicit allow rules. The Kubernetes NetworkPolicy API reference documents every field.

Common patterns: allow ingress from specific namespaces (frontend can talk to backend), allow egress to specific external destinations (apps can talk to the database but nothing else), deny all egress to the internet (forces all traffic through a controlled egress proxy). Build the library of patterns once and reuse.

eBPF and the Future of Kubernetes Networking

eBPF moves data plane logic into the Linux kernel itself. The performance gains over iptables-based networking are real, and the capability gains (L7 awareness without proxies, transparent encryption, deep observability) are substantial.

Cilium leads this transition. As eBPF support matures across kernel versions and distributions, expect kube-proxy replacement and sidecar-less service mesh to become the default rather than the exotic option.

Migration Between CNI Plugins

Replacing the CNI plugin on a running cluster is painful. Most teams replace the whole cluster instead. The new cluster runs alongside the old one; workloads migrate; the old cluster is decommissioned.

For migration-in-place situations (which are rare and high-risk), drain nodes, replace the CNI configuration, and rejoin them one at a time. Test thoroughly in staging — different CNI plugins have different pod CIDR ranges, IP allocation behaviors, and policy semantics. Surprises are common.

Plan CNI choice carefully upfront. Switching costs are high enough that the initial decision deserves real evaluation.

Observability Through CNI

Modern CNI plugins (especially Cilium with Hubble) expose detailed network observability — per-flow visibility, identity-aware traffic visualization, policy decision tracing. This is dramatically more useful than basic iptables-rule-based debugging.

Hubble specifically lets you see, for example, ‘service A tried to talk to service B and was blocked by network policy X.’ That insight is hard to extract from older tooling.

For organizations debugging frequent network-level issues, the observability story is one of the strongest arguments for Cilium. Calico is catching up with eBPF data plane and Whisker observability; the gap is narrowing.

Container Image Provenance

Knowing where your container images come from is foundational. Pinning by digest (not by tag) gives immutability. Signing with Sigstore or Notary provides authenticity.

Build provenance — recording how the image was built, from what source, by which CI system — adds an additional layer. SLSA attestations capture this in a standardized format.

For organizations subject to executive orders or regulatory frameworks requiring software supply chain controls, provenance becomes mandatory rather than optional. Building the practice into normal CI early is cheaper than retrofitting under audit pressure.

Observability for Kubernetes Workloads

Standard observability for Kubernetes includes: pod metrics (cAdvisor exposed via kubelet), node metrics (node-exporter), API server and controller metrics, and application metrics via service annotations or ServiceMonitor.

The kube-prometheus-stack Helm chart bundles all of this with pre-built dashboards and alerts. Most clusters that want quick observability install it and customize from there. For deeper observability — distributed tracing across pods, application-level instrumentation — OpenTelemetry layers on top.

Logs follow a similar pattern. Fluent Bit or Vector as the agent, shipping to a centralized log store (Loki, Elasticsearch, CloudWatch). Per-pod metadata enrichment makes logs searchable by deployment, namespace, and pod labels.

Capacity Planning and Right-Sizing

Kubernetes capacity planning has two layers: cluster capacity (how many nodes, what types) and workload capacity (resource requests and limits). Both deserve attention.

For cluster capacity, observe peak utilization and plan headroom. 70-80% peak utilization is a healthy target — below that, you’re paying for idle capacity; above that, autoscaling lag and burst patterns can cause issues.

For workload capacity, the right-sizing tools mentioned earlier surface candidates. Schedule quarterly right-sizing reviews. Service growth and traffic pattern changes mean yesterday’s right-size is today’s waste or saturation.

Image Optimization for Production

Beyond best practices in the Dockerfile, image optimization at the repository level pays back across many services. Standardize on a small set of base images, share optimization patterns across teams, and centralize the security-update process for those base images.

Internal base images that wrap upstream images with organization-specific additions (corporate certs, common tools, security agents) reduce per-service complexity. Build them with the same discipline as application images — pinned dependencies, signed, scanned.

Image size impacts pull time, which impacts pod startup, which impacts autoscaling responsiveness and rolling deploy duration. The end-to-end effect is larger than the per-image savings suggest.

Operational Recommendations

For teams running production Kubernetes workloads, a small set of disciplines pays back across nearly every dimension of cluster operation. Define resource requests and limits for all production workloads. Establish a network policy posture that defaults to deny. Run regular cluster upgrades on a defined cadence. Monitor cluster health alongside application health.

These aren’t novel recommendations — they appear in every Kubernetes best-practices guide. They’re rarely fully implemented in production clusters that grew organically. The work of bringing existing clusters to this baseline is significant but worthwhile.

For new clusters, build these in from the start. Templates and operators can enforce the baseline; documentation captures the intent. Each new service onboarded gets the right defaults rather than requiring later remediation.

Operational maturity in Kubernetes is incremental. Pick the next improvement, implement it, move on. The compounded effect over time is what separates well-operated clusters from clusters that work but feel fragile.

Frequently Asked Questions

Can I switch CNI plugins after cluster creation?

Painfully. Most teams replace the whole cluster rather than migrate in place. Plan the choice upfront.

Does Cilium require a specific Linux kernel?

Cilium needs reasonably recent kernels for full feature support (5.10+ for most features). Modern node OS images cover this.

What about Weave or kube-router?

Both are functional but less actively developed than the big three. Wouldn’t recommend for new clusters.

Is the cloud-managed CNI fine?

For many workloads, yes. The trade-off is policy capability and feature richness vs operational simplicity.