#ai#agents#cloud#edge#finops

Workers AI + Kimi K2.5: Enterprise Blueprint for Session-Aware Agent Platforms

March 22, 2026

Cloudflare’s announcement that Workers AI now runs frontier-scale models such as Kimi K2.5, with emphasis on prefix caching and x-session-affinity, is more than a model catalog expansion. It is a signal that agent workloads are shifting from “prototype APIs” to full-stack platform engineering.

Reference: https://blog.cloudflare.com/workers-ai-large-models/.

Architectural implication

Treat agent execution as a stateful distributed system with explicit controls for:

session locality
context reuse
tool-call isolation
cost predictability

Without those controls, large context windows become a tax on both latency and budget.

Recommended deployment topology

Workers for request admission, auth, and policy checks
Durable Objects for per-session state and affinity keys
Workers AI for inference execution
Workflows for long-running orchestration
R2/KV for artifacts, summaries, and checkpoints

This architecture keeps state, compute, and governance in one observable surface.

Prefix cache as a product metric

Most teams treat caching as backend implementation detail. For agents, cache behavior should be a first-class product KPI:

cache hit ratio per workflow
cached-token percentage over time
TTFT variance by session class

If these are invisible, cost anomalies are discovered too late.

Reliability patterns

Idempotent tool execution with request IDs.
Deadline-aware retries separated for prefill-heavy and generation-heavy tasks.
Summarization checkpoints every N turns to cap context growth.
Region-aware routing for data-boundary requirements.

Security controls

scoped credentials per tool adapter
outbound allowlists for web/data access
structured redaction before prompt/response logging
immutable policy decisions attached to session records

60-day migration approach

Phase 1 (Weeks 1–2): baseline token, latency, and failure profiles.
Phase 2 (Weeks 3–4): activate session affinity and cache reporting.
Phase 3 (Weeks 5–6): split workloads by risk and latency class.
Phase 4 (Weeks 7–8): enforce policy-to-routing and formal SLOs.

Closing

Large-model access is table stakes; operational discipline is differentiation. Teams that operationalize session affinity, context budgets, and policy telemetry will ship faster without uncontrolled AI spend.

Recommended for you

Marcus Wright

Cloudflare Workers AI in Production: Session Memory, Guardrails, and Cost-Stable Agent Ops

A practical operating model for running agent workloads with Workers, Durable Objects, and policy-first controls across latency and cost constraints.

Apr 24, 2026 · #cloud #edge #ai #agents #finops

Marcus Wright

Cloudflare Unified Inference Layer: A Production Architecture for Multi-Provider Agent Systems

How to turn AI Gateway unification and Workers AI bindings into resilient routing, observability, and spend control.

Apr 20, 2026 · #ai #agents #cloud #edge #finops

Marcus Wright

Cloudflare Workers AI in Production: Session Affinity, Cost Guardrails, and Governance

A practical operating model for teams adopting Workers AI large models with deterministic session handling, policy-aware tool use, and predictable cost behavior.

Apr 16, 2026 · #ai #agents #cloud #edge #finops

← Back to Stories