Capstone Phase 3: Self-Correction and Safety Gates

The Principle of Defensive Agent Engineering

Exposing execution paths to autonomous agents requires robust defense-in-depth. Developers must never trust the model to follow system prompt rules (e.g. "Do not refund more than $100"). Prompt guidelines can be bypassed through user injection or model hallucination.

To protect the business from unauthorized or erroneous disbursements, organizations enforce a strict refund limit rule where refunds under $100.00 are automatically approved and processed, while any refund exceeding $100.00 requires manual manager review. In an agentic system, we must enforce this constraint programmatically within the tool itself, ensuring the model cannot bypass it by generating direct tool execution payloads.

Instead, safety boundaries must be enforced programmatically by the execution environment:

Encapsulated REST API Schemas: Restricting the model to specific read-only API endpoints (like get_order_details) completely prevents the threat of SQL injection or arbitrary data modification.
Transactional Limits: Hardcoding validation checks within execution functions (such as capping refunds at $100.00) that raise structured exceptions when breached.
Critique-Reflection Loops: When a tool raises a safety exception, the error message is fed back to the model as context. The model reflects on the failure, corrects its logic, and generates a revised request or explains the constraint to the customer.

Safety Gate Architecture

The sequence diagram below details the safety validation and reflection cycle:

Safety Gates & Reflection Loop

LLM Agent: Formulates a tool execution payload (such as a high-value refund or database lookup).
Safety Gate: Validates input parameters against programmatic security rules. If invalid, it raises a native exception.
DB / Executor: Only executes operations that pass all safety checks.
Reflection Loop: Intercepted exceptions are returned as tool feedback, prompting the model to revise its actions.

Interactive Playground: Safety Gates and Critique

The safety validation paths and critique-reflection loop of the playground execution are trace-mapped below:

Safety Gates Reflection Loop Flow

The playground below implements the defensive support agent. The database is exposed via a read-only endpoint, and the refund tool programmatically rejects transactions above $100.00.

Observe how the agent reacts when a customer requests a refund for a high-value order ($150.00) versus a low-value order ($50.00).

Try It Yourself

In Scenario A, the agent queries the get_order_details API and tries to request a refund of $150.00. The request_refund tool raises a ValueError exception because the amount exceeds the $100 limit. The safety interceptor catches this exception and feeds it back to the agent. The agent reads the error, reflects, and informs the user that a manual review is required.

In Scenario B, the order amount is $50.00, which satisfies the validation check. The safety gate permits execution, and the refund is successfully completed.

The final phase covers Systematic Evaluation and Tuning, where we benchmark our finished agent against a validation dataset.