Always-On AI Is Becoming a Network Engineering Problem
Trend Signals
- ITmedia highlighted joint efforts to address traffic growth caused by always-on AI systems.
- Cloudflare engineering posts emphasized transport resilience and client behavior in modern SASE paths.
- Teams on HN increasingly report “network-shaped” incidents in AI-assisted workflows.
Why AI Traffic Is Different
Traditional web traffic has relatively predictable burst patterns. Always-on AI introduces:
- Longer-lived sessions with higher request complexity
- Token-streaming behavior that amplifies tail latency sensitivity
- Multi-hop chains (retrieval, tools, policy checks) per user action
- Greater dependence on transport quality for UX continuity
As a result, AI reliability is no longer only about model serving. It is about end-to-end traffic choreography.
The Three-Layer Bottleneck Model
1) Edge and Client Path
- MTU mismatches, packet loss, and protocol fallback can quietly degrade generation latency.
- Mobile and enterprise VPN clients create asymmetric path quality.
2) Service Mesh / Internal East-West
- Retrieval and tool calls multiply service-to-service traffic.
- Timeout defaults designed for CRUD APIs fail for streaming workloads.
3) Model Runtime Tier
- Queueing effects dominate during soft saturation.
- GPU/accelerator utilization can look “healthy” while user latency collapses.
Operational Controls That Work
Introduce AI-aware SLOs
- First-token latency (P95)
- Stream interruption rate
- Tool-chain completion latency
- Retrieval miss-to-fallback ratio
Build traffic classes
- Interactive premium (strict latency budget)
- Standard interactive
- Deferred batch inference
Enforce class-based admission during spikes to protect critical UX.
Engineer graceful degradation
- Compress retrieval breadth before model quality drops
- Switch from multi-tool to single-tool plans when congestion rises
- Return concise mode under severe saturation
Capacity Planning Playbook
- Model token demand by workflow, not by endpoint.
- Simulate monthly and incident-time surges.
- Add transport chaos tests (loss, jitter, PMTU mismatch).
- Validate failover behavior for model and retrieval tiers independently.
Common Failure Pattern
Many teams scale inference nodes but ignore path instability and downstream fan-out. The visible symptom is “model slowdown,” but root cause sits in network and orchestration layers. Fixing this requires a joint SRE + platform + ML operations routine.
Bottom Line
Always-on AI is effectively creating a new category: LLM traffic engineering. Organizations that formalize it now will prevent a year of false model blame and expensive overprovisioning.