Iterate on LLMs faster
Measure LLM quality and catch regressions
Used by 10,000+ developers at companies like
... to rapidly improve prompts and evaluate models
Simple, declarative config
# Compare prompts...
prompts:
- "Summarize this in {{language}}: {{document}}"
- "Summarize this in {{language}}, concisely and professionally: {{document}}"
# And models...
providers:
- openai:gpt-4-0125-preview
- anthropic:claude-3-opus
- mistral:mistral-large-latest
# ... using these tests
tests:
- vars:
language: French
document: "To be or not to be, that is the question..."
assert:
- type: contains
value: "Être ou ne pas être"
- type: cost
threshold: 0.01
- type: latency
threshold: 1000
- type: llm-rubric
value: does not apologize
- # ...
How it works
Create a test dataset
Use a representative sample of user inputs to reduce subjectivity when tuning prompts.
Set up evaluation metrics
Use built-in metrics, LLM-graded evals, or define your own custom metrics.
Select the best prompt & model
Compare prompts and model outputs side-by-side, or integrate the library into your existing test/CI workflow.