Evaluation


Overview

Agent development doesn’t end after the first working version. To ensure reliability, performance, and accuracy, Chicory AI provides an evaluation framework where you can bring your own validation dataset, define checks, and evolve your agent through iterative testing.

This forms the Agent Development Life Cycle (ADLC):

  1. Build → 2. Evaluate → 3. Evolve → 4. Deploy → 5. Monitor


Dataset Format

When you upload a validation dataset on the Chicory platform, it follows a structured format (downloadable in the platform UI):

  • Input: Example prompts, PR diffs, pipeline queries, or tasks you expect the agent to handle.

  • Expected Output: Ground truth answer, SQL snippet, or feedback the agent should produce.

  • Evaluation Guideline: Tags like difficulty, domain, or test category.

Chicory Agent Creation

1

Define Evaluation Criteria

Before running tests, think through what matters most for your use case:

  • Accuracy – Does the output match the ground truth?

  • Completeness – Did the agent address all parts of the task?

  • Efficiency – Was the SQL or code optimized?

  • Clarity – Is the explanation useful to the user?

  • Reliability – Does the agent behave consistently across runs?

Chicory recommends starting small (accuracy + efficiency), then layering in more criteria as the agent matures.

2

Run Evaluation

  • Upload or select your validation dataset.

  • Choose the agent you want to test.

  • Select evaluation criteria.

  • Run the evaluation.

Chicory will score each input/output pair and generate a detailed report.

Chicory Agent Creation

Example:

3

Analyze Results

The evaluation output includes:

  • Scores by criterion (accuracy %, efficiency rating, etc.)

  • Detailed grader feedback for each test case

  • Aggregate metrics for quick comparison between agent versions

4

Evolve the Agent

Based on results, you can:

  • Tune prompts to improve accuracy

  • Add rules or guardrails for reliability

  • Refactor code/output templates for efficiency

  • Re-train on weak spots using examples from failed cases

After each adjustment, rerun the evaluation to confirm improvements. Over multiple iterations, this process evolves your agent to production-ready quality.

5

Promotion to Deployment

When the agent consistently meets your benchmarks:

  • Promote the agent to deployment using REST API or MCP Gateway


Key Takeaways

  • Validation datasets + LLM Graders create a feedback loop.

  • Evaluation helps you quantify quality, not just eyeball outputs.

  • Iterative cycles = evolve before you deploy.

  • Once satisfied, the agent can move confidently into production usage.


Last updated