Test Suites
Run multiple tests from a YAML configuration file for comprehensive LLM benchmarking.
Overview
Test suites allow you to define a collection of tests in a YAML file and run them all at once, with parallel execution support and comprehensive reporting.
Creating a Test Suite
Create a YAML file with your tests:
# tests.yaml
tests:
- name: "math_test"
prompt: "What is 15 * 23?"
expected: "345" # Optional: for objective comparison
- name: "python_test"
language: "python" # Use plugin evaluator
prompt: "Write Python factorial function"
expected: "120"
- name: "creative_test"
prompt: "Write a short story about a robot"
# No expected field - subjective task
- name: "model_specific_test"
prompt: "Explain quantum physics"
model: "gpt-4o"
Running Suites
Basic Usage
# Run entire test suite
praisonaibench --suite tests.yaml
# Run specific test from suite
praisonaibench --suite tests.yaml --test-name "rotating_cube_simulation"
# Run suite with specific model (overrides individual test models)
praisonaibench --suite tests.yaml --model xai/grok-code-fast-1
Parallel Execution
# Run tests in parallel (3 concurrent workers)
praisonaibench --suite tests.yaml --concurrent 3
Global Configuration
Set global LLM configuration that applies to all tests:
# Global LLM configuration
config:
max_tokens: 4000
temperature: 0.7
top_p: 0.9
frequency_penalty: 0.0
presence_penalty: 0.0
tests:
- name: "creative_writing"
prompt: "Write a detailed sci-fi story"
model: "gpt-4o"
Using the Expected Field
| Use Case | Include Expected? |
|---|---|
| Factual questions | ✅ Yes |
| Math problems | ✅ Yes |
| Code output | ✅ Yes |
| Deterministic tasks | ✅ Yes |
| Creative tasks | ❌ No |
| Open-ended questions | ❌ No |
| Visual/interactive content | ❌ No |
Scoring Impact: - When provided: Adds 20% objective scoring based on similarity - When omitted: Weights automatically normalize (no penalty)