End-to-End Trajectory Testing

Conversational Trajectories and Traces

In simple single-turn API applications, auditing quality is as straightforward as inspecting the final output. For agentic systems, however, the path taken to reach an answer is just as critical as the answer itself. We refer to this multi-turn path of thoughts, actions, tool calls, and model outputs as the execution trajectory (or trace).

Auditing conversational trajectories is vital because an agent can easily achieve goal success while violating operational bounds. For example, a customer service agent might successfully refund an order but do so by calling three redundant lookup tools, consuming excess tokens, and keeping the user waiting for 15 seconds.

E2E evaluations track the entire loop from the initial user query to the final response, inspecting:

Functional Success: Did the agent resolve the user's core request?
Trajectory Efficiency: Did the agent run into repetitive loops or exhibit "turn bloat"?
Operational Safety: Did the agent violate any safety boundaries during intermediary tool interactions?

Analyzing Trajectory Efficiency and Turn Bloat

One of the primary metrics in agent engineering is the Turn Count (the number of conversational rounds between the agent and tools).

An unexpected spike in turn count signals turn bloat or tool loops:

Tool Loops: The model executes a tool, receives an error, and retries the same tool with minor argument variations, repeating this loop indefinitely.
Planner Indecision: The agent drafts plans, rejects them, and drafts new plans without executing any concrete actions.

Evaluating turn efficiency allows developers to configure appropriate max_turns constraints, optimizing the latency-to-cost trade-off.

End-to-End Evaluation Flow

The diagram below details the E2E evaluation loop:

E2E Trajectory Evaluation

Trace Collection: Every turn is appended to a structured log file or telemetry database.
Metric Extraction: Scripted parsers calculate the final turn count, response latency, and API costs.
LLM grading: A simulated or real LLM judge reads the complete transcript, evaluating qualitative alignment against a rubric.

Interactive Playground: Writing the Eval Loop

In this playground, you will write a complete E2E evaluation loop.

We have provided:

A Golden Dataset containing customer service queries.
A Support Agent function (run_support_agent) that simulates multi-turn execution.
A Simulated LLM Judge (run_simulated_llm_judge) that evaluates the trajectory and outputs a grade.

Your task is to write the evaluation loop logic to run the queries, record the trajectories, invoke the LLM judge, and compute the summary scorecard (average turns, token cost, and success rate %).

Try It Yourself

E2E evaluations are the ultimate gatekeeper in agent deployments. By calculating metrics like turn count and token cost, and combining them with semantic assessments from an LLM judge, we can establish continuous integration quality gates to monitor performance and catch regressions.