Script conversations. Run evaluations. Catch regressions before your users do. Like Playwright — but for agents that talk back.
Free for founding members · No credit card required
The problem
MMLU and HumanEval measure capability, not behavior. Your agent aces benchmarks but derails in production conversations.
You test by chatting with your agent. Then a teammate changes a prompt and nobody re-tests. Regressions ship silently.
LangSmith and Arize log what happened. They don’t tell you whether the agent should have said that. Logging is not evaluation.
How it works
Define conversation trees with fuzzy expectations. Run them across agents and personas. Review results in a graph that lights up with eval scores.
# onboarding-flow.yaml
nodes:
- id: greeting
prompt: "Build me a website"
expect:
- asks_clarifying_questions: ">= 0.8"
- does_not_hallucinate_stack: ">= 0.9"
branches:
- on: clarifies_scope
goto: scope
- on: jumps_to_code
goto: architecture
- id: scope
prompt: "A language learning platform"
expect:
- proposes_concrete_structure: ">= 0.7"
- asks_about_audience: ">= 0.5"
- id: architecture
prompt: "What stack would you use?"
expect:
- justifies_choices: ">= 0.8"
- avoids_overengineering: ">= 0.6"$ codefresh run onboarding-flow.yaml \
--agents gpt-4o,claude-3.5 \
--personas hostile,novice,expert \
--parallel
Running 6 evaluations (2 agents × 3 personas)...
gpt-4o × hostile ████████████████ done 2.3s
gpt-4o × novice ████████████████ done 1.8s
gpt-4o × expert ████████████░░░░ ... 1.2s
claude-3.5 × hostile ████████████████ done 2.1s
claude-3.5 × novice ████████████████ done 1.6s
claude-3.5 × expert ████████░░░░░░░░ ... 0.9sEvaluate — Click any node above to explore its conversation script, expectations, and branches. Add notes to track your observations.
See it in action
Every conversation has three participants: the script author, the agent under test, and the evaluator. Each gets a distinct typographic treatment so you can scan transcripts without reading a word.
Script
Agent
That's an interesting project. Before I start designing, I'd like to understand a few things about the target audience and scope.
What age group are we targeting? Are we building for adults learning conversationally, or students in a formal education setting? And should the platform support multiple source languages, or is it primarily English-to-X?
Evaluator
✓ asks_clarifying_questions (0.92)
✓ doesnt_correct_grammar (1.00)
⚠ proposes_concrete_structure (0.45)
Script
Agent
Great, conversational Spanish for adults. I'd suggest structuring the platform around three core experiences: daily micro-lessons (5-10 minutes), interactive dialogue practice with AI conversation partners, and a spaced repetition vocabulary system.
Evaluator
✓ proposes_concrete_structure (0.88)
✓ asks_about_audience (0.95)
Script prompts render in JetBrains Mono at 13px on a recessed background. Agent responses use Plus Jakarta Sans at 15px with 1.7 line-height for readable prose. Evaluator annotations are 12px, 500 weight, in copper.
Features
Conversations branch. Your tests should too. Define non-linear conversation trees with conditional paths.
Not pass/fail — scored. Set thresholds like ‘asks clarifying questions ≥ 0.8’ and let LLM-as-judge evaluate.
Run the same script through hostile, novice, adversarial, and confused personas. One script, many perspectives.
When agents go off-script, CodeFresh doesn’t fail — it improvises. Auto-generates evaluation nodes for deviations.
Run behavioral tests in your pipeline. Block deploys when agent behavior regresses. Ship with confidence.
HTTP adapters for OpenAI, Anthropic, LangChain, CrewAI, AutoGen, and custom endpoints. Bring your own agent.
CodeFresh is building the behavioral testing framework AI engineers actually need. Get early access and shape the product.
Founding members get lifetime access to Pro features