Programmatic Evaluation Methods

The Agentic Testing Dilemma

Traditional software engineering relies on deterministic unit testing where a function given a static input must return a predictable output. We assert on exact string equality, database schemas, or API status codes. For language model agents, these assumptions fail because of three architectural realities.

First, language models are probabilistic engines, which means that even at a temperature of 0.0, the same prompt run repeatedly can result in alternative choices of thoughts, minor variations in tool parameter formatting, or different phrasing in the final response. Second, a user query like "Help me reconcile my billing discrepancies" has no single correct textual response, meaning there are infinite ways to draft a correct summary. Third, an agent does not just output text; it interacts with databases, schedules cron jobs, and issues API payloads. Testing only the final message ignores critical intermediary states, such as whether the agent executed an unnecessary loop or mutated a database record incorrectly.

To build production-grade agentic systems, we must transition from classical unit assertions to systematic evaluation pipelines.

The Evaluation Matrix

To understand how to test agents, we map evaluations along two main axes: Method (how the grading is executed) and Scope (what part of the system is being tested). This creates a four-quadrant evaluation matrix.

The Evaluation Matrix

Programmatic Assertions

Programmatic assertions are deterministic checks written directly in application code to validate structured outputs. They are fast, free, and execute with 100% reliability, making them perfect for enforcing safety boundaries and system constraints.

Developers use programmatic assertions to enforce three main types of constraints:

Schema Enforcement: Checking that a tool call matches a target JSON schema before executing it.
State Verification: Running assertions on database tables to verify that an agent action occurred correctly.
Constraint Checking: Restricting outputs using regular expressions, such as blocking SQL write keywords or ensuring system secrets are never leaked.

While programmatic assertions are excellent at checking structure and safety boundaries, they are blind to semantic validity, conversational flow, and task success. We will cover how to audit those qualitative traits in subsequent lessons.

Interactive Playground: Validating Parsed Data

In this exercise, you will write programmatic validation checks to audit a simulated AI parser.

The AI parser extracts structured numbers and dates from raw email text. Because AI models are probabilistic, they can occasionally return dates in relative formats (e.g. "yesterday"), extract invalid calendar dates (e.g., month 13), or output negative amounts. Your programmatic asserts inside validate_parsed_data must enforce:

Format Constraints: The date string must match the ISO format YYYY-MM-DD using a regular expression pattern.
Calendar Constraints: The parsed date components must represent a valid calendar month (between 1 and 12) and day (between 1 and 31).
Boundary Constraints: The parsed transaction amount must be a positive number (greater than or equal to 0).

Run the playground to see the initial validation failures, then implement the assertions to make all test cases pass.

Programmatic Parser Validation Flow

Try It Yourself