Latency, Cost, and Optimization

The Latency and Cost Constraints in Agentic Design

Moving from static chatbot applications to autonomous agentic loops shifts the engineering profile from simple request-response operations to multi-turn execution flows. In an agentic system, every turn of the loop requires an LLM inference call. If an agent goes through 5 reasoning steps and tool calls to satisfy a single user query, it incurs 5x the latency and 5x the token cost of a single turn.

To design production-grade agents, engineers must balance two main operational constraints:

Latency Budgets: Humans expect responsive interfaces. While a 2-second delay for a search engine query is acceptable, a multi-agent system executing sequentially for 30 seconds to draft an email will suffer high user abandonment rates.
Token Budgets (Financial Limits): LLMs charge per input and output token. Long system prompts, extensive database schemas in tool descriptions, and growing conversation history compound costs quadratically if not actively managed.

Under the Rule of Parsimony, we should use the least autonomous pattern (Workflow -> Router -> Agent) that satisfies the requirements. When an autonomous agent loop is necessary, we must apply specific engineering patterns to optimize latency and cost.

Prompt Caching: Reusing Context Across Turns

In an agentic loop, the system prompt, tool definitions, and few-shot examples remain identical across every turn of the conversation. Re-sending this static context on every iteration is highly inefficient.

Prompt Caching allows the LLM provider to cache the prefix of the prompt (the system instructions and tool definitions) on their servers. When subsequent turns are submitted, the model inspects the cached prefix, bypassing the need to re-process those tokens.

Latency Savings: Time to First Token (TTFT) for cached prompts drops from ~800ms to <100ms.
Financial Savings: Providers typically offer a 90% discount on cached input tokens compared to standard input tokens.
Implementation Rule: Cache matches are prefix-based. Therefore, engineers must place static instructions and tool definitions at the beginning of the prompt, and dynamic components (like the user query and variable session history) at the end.

Parallel and Concurrent Tool Execution

By default, naive agent loops process tool calls sequentially. If an agent decides it needs to lookup a user's subscription and query their recent transactions, it calls the first tool, waits for the result, feeds it back to the model, and then calls the second tool.

Concurrent Tool Execution optimizes this by allowing the agent to generate multiple tool calls in a single completion turn. The runtime host parses these requests and executes the independent operations in parallel (e.g., using asyncio in Python or worker threads).

Latency and Cost Optimization Patterns

As shown in the timeline, running tools concurrently cuts the number of required round-trips to the LLM and processes independent network-bound tool requests in parallel, reducing overall turn latency by 40% or more.

Dynamic Token Pruning

As the conversation progresses, the history of messages (turns) accumulates in the prompt context. If left unchecked, this history eventually exceeds the context window or bloats the cost of every execution.

Engineers must implement dynamic token pruning strategies:

Sliding Window: Keep only the last $N$ turns of the conversation in active context, discarding older turns.
Summarization (Pruning): When conversation history exceeds a token limit, trigger a background task to summarize the oldest turns. Replace those raw turns with the concise summary block.
Semantic Retrieval (Conversation Memory): Store older turns in a vector database and retrieve only the relevant conversation snippets matching the current user turn, keeping the active context size minimal.

Small-Model Routing

Not all reasoning steps in an agent loop require high-tier reasoning engines (e.g., Claude 3.5 Sonnet or GPT-4o). Simple tasks - such as classifying user intent, extracting parameters from a raw query, or formatting tool outputs - can be routed to smaller, cheaper models (e.g., Claude 3.5 Haiku, GPT-4o-mini, or Gemini Flash).

A Router model evaluates the incoming request and determines the routing path:

Low Complexity / Triage: Dispatched directly to a fast, cost-effective small model.
High Complexity: Dispatched to the full reasoning agent model.

By offloading triage and straightforward sub-tasks to small models, engineers preserve their token budget and minimize latency for the majority of execution paths.