Model-Graded Evaluation Methods

Why Programmatic Checks Fail for Semantics

While programmatic assertions are excellent at checking JSON structures, API response codes, and formatting rules, they are blind to the actual meaning and quality of natural language. An agent can easily produce output that is grammatically valid and structurally complete, yet factually incorrect, rude, or unhelpful.

For example, if a user angrily queries "Why is my shipment late?", programmatic rules can ensure the agent returns a string containing a tracking number. However, programmatic code cannot easily determine whether the agent's tone was empathetic, if it apologized appropriately, or if it hallucinated policies.

To evaluate open-ended natural language, we use Model-Graded Evaluation (LLM-as-a-Judge).

Model-Graded Evaluation (LLM-as-a-Judge)

Model-graded evaluation utilizes a separate, typically larger and more capable language model (the "Judge") to evaluate the output of the target agent. The judge is provided with the user's input query, any retrieved context documents, the agent's response, and a strict rubric defining quality criteria.

Using LLM judges offers several advantages:

Semantic Understanding: The judge understands synonyms, paraphrases, and context, allowing it to verify if an answer is factually correct without requiring an exact string match.
Qualitative Auditing: The judge can evaluate abstract traits like politeness, tone, clarity, and safety alignment.
Scalability: Automated judges mimic human scoring at a fraction of the cost and time, allowing developers to run hundreds of evaluations in continuous integration pipelines.

However, model-graded evals also introduce challenges: they add token cost, add latency, and are themselves subject to the same probabilistic nondeterminism they are designed to test. To make LLM judges reliable, we write highly structured prompts that guide the model to output grades (e.g. GRADE: PASS or GRADE: FAIL) followed by a step-by-step reasoning chain.

Interactive Playground: Politeness & Tone Auditor

In this exercise, you will complete the LLM-as-a-judge validation logic.

We have a dataset containing simulated customer interactions. You must implement the quality auditor inside run_llm_judge to:

Draft a Rubric System Prompt: Write instructions telling the judge model to evaluate if the agent's response was polite and professional.
Make the Completions Call: Invoke the completions API proxy to obtain the verdict. The judge must output exactly: GRADE: PASS or GRADE: FAIL, followed by its reason.

Run the playground to see the judge outputs, then complete the implementation to audit the agent's tone.

Model-Graded Auditor Flow

Try It Yourself