Evaluation
Overview
Agent development doesn’t end after the first working version. To ensure reliability, performance, and accuracy, Chicory AI provides an evaluation framework where you can bring your own validation dataset, define checks, and evolve your agent through iterative testing.
This forms the Agent Development Life Cycle (ADLC):
Build → 2. Evaluate → 3. Evolve → 4. Deploy → 5. Monitor
Dataset Format
When you upload a validation dataset on the Chicory platform, it follows a structured format (downloadable in the platform UI):
Input: Example prompts, PR diffs, pipeline queries, or tasks you expect the agent to handle.
Expected Output: Ground truth answer, SQL snippet, or feedback the agent should produce.
Evaluation Guideline: Tags like difficulty, domain, or test category.

Define Evaluation Criteria
Before running tests, think through what matters most for your use case:
Accuracy – Does the output match the ground truth?
Completeness – Did the agent address all parts of the task?
Efficiency – Was the SQL or code optimized?
Clarity – Is the explanation useful to the user?
Reliability – Does the agent behave consistently across runs?
Chicory recommends starting small (accuracy + efficiency), then layering in more criteria as the agent matures.
Evolve the Agent
Based on results, you can:
Tune prompts to improve accuracy
Add rules or guardrails for reliability
Refactor code/output templates for efficiency
Re-train on weak spots using examples from failed cases
After each adjustment, rerun the evaluation to confirm improvements. Over multiple iterations, this process evolves your agent to production-ready quality.
Key Takeaways
Validation datasets + LLM Graders create a feedback loop.
Evaluation helps you quantify quality, not just eyeball outputs.
Iterative cycles = evolve before you deploy.
Once satisfied, the agent can move confidently into production usage.
Last updated
