Now accepting early access signups

Behavioral testing
for AI agents

Script conversations. Run evaluations. Catch regressions before your users do. Like Playwright — but for agents that talk back.

Free for founding members · No credit card required

The problem

AI agent testing is broken

Benchmarks lie

MMLU and HumanEval measure capability, not behavior. Your agent aces benchmarks but derails in production conversations.

Manual testing doesn't scale

You test by chatting with your agent. Then a teammate changes a prompt and nobody re-tests. Regressions ship silently.

Observability ≠ testing

LangSmith and Arize log what happened. They don’t tell you whether the agent should have said that. Logging is not evaluation.

How it works

Script → Run → Evaluate

Define conversation trees with fuzzy expectations. Run them across agents and personas. Review results in a graph that lights up with eval scores.

onboarding-flow×hostile-persona
pass warn fail freeball
01

Define in YAML or the graph editor

# onboarding-flow.yaml
nodes:
  - id: greeting
    prompt: "Build me a website"
    expect:
      - asks_clarifying_questions: ">= 0.8"
      - does_not_hallucinate_stack: ">= 0.9"
    branches:
      - on: clarifies_scope
        goto: scope
      - on: jumps_to_code
        goto: architecture

  - id: scope
    prompt: "A language learning platform"
    expect:
      - proposes_concrete_structure: ">= 0.7"
      - asks_about_audience: ">= 0.5"

  - id: architecture
    prompt: "What stack would you use?"
    expect:
      - justifies_choices: ">= 0.8"
      - avoids_overengineering: ">= 0.6"
02

Run across agents & personas

$ codefresh run onboarding-flow.yaml \
    --agents gpt-4o,claude-3.5 \
    --personas hostile,novice,expert \
    --parallel

Running 6 evaluations (2 agents × 3 personas)...

  gpt-4o × hostile     ████████████████ done  2.3s
  gpt-4o × novice      ████████████████ done  1.8s
  gpt-4o × expert      ████████████░░░░ ...   1.2s
  claude-3.5 × hostile  ████████████████ done  2.1s
  claude-3.5 × novice   ████████████████ done  1.6s
  claude-3.5 × expert   ████████░░░░░░░░ ...   0.9s
03

Evaluate — Click any node above to explore its conversation script, expectations, and branches. Add notes to track your observations.

See it in action

Three voices, instantly scannable

Every conversation has three participants: the script author, the agent under test, and the evaluator. Each gets a distinct typographic treatment so you can scan transcripts without reading a word.

onboarding-flow×hostile-persona
✓ 0.92

Script

"Design a website for learning a second language"

Agent

That's an interesting project. Before I start designing, I'd like to understand a few things about the target audience and scope.

What age group are we targeting? Are we building for adults learning conversationally, or students in a formal education setting? And should the platform support multiple source languages, or is it primarily English-to-X?

Evaluator

asks_clarifying_questions (0.92)

doesnt_correct_grammar (1.00)

proposes_concrete_structure (0.45)

Script

"A language learning platform for adults, English to Spanish. Focus on conversational fluency."

Agent

Great, conversational Spanish for adults. I'd suggest structuring the platform around three core experiences: daily micro-lessons (5-10 minutes), interactive dialogue practice with AI conversation partners, and a spaced repetition vocabulary system.

Evaluator

proposes_concrete_structure (0.88)

asks_about_audience (0.95)

Script prompts render in JetBrains Mono at 13px on a recessed background. Agent responses use Plus Jakarta Sans at 15px with 1.7 line-height for readable prose. Evaluator annotations are 12px, 500 weight, in copper.

Features

Everything you need to test agent behavior

Graph-based scripts

Conversations branch. Your tests should too. Define non-linear conversation trees with conditional paths.

Fuzzy expectations

Not pass/fail — scored. Set thresholds like ‘asks clarifying questions ≥ 0.8’ and let LLM-as-judge evaluate.

Persona testing

Run the same script through hostile, novice, adversarial, and confused personas. One script, many perspectives.

The Freeball Engine

When agents go off-script, CodeFresh doesn’t fail — it improvises. Auto-generates evaluation nodes for deviations.

CI/CD gates

Run behavioral tests in your pipeline. Block deploys when agent behavior regresses. Ship with confidence.

Any agent, any framework

HTTP adapters for OpenAI, Anthropic, LangChain, CrewAI, AutoGen, and custom endpoints. Bring your own agent.

Stop shipping untested agents

CodeFresh is building the behavioral testing framework AI engineers actually need. Get early access and shape the product.

Founding members get lifetime access to Pro features