Skip to content

Evaluation System

Research-backed hybrid evaluation providing comprehensive quality assessment.

Overview

PraisonAI Bench uses a multi-component evaluation system that combines static validation, functional testing, and AI-powered quality assessment.

Evaluation Components

Component What It Does Score Weight
📝 HTML Validation Static structure validation, DOCTYPE, required tags 15%
🌐 Functional Browser rendering, console errors, render time 40%
🎯 Expected Result Objective comparison (optional, for factual tasks) 20%*
🎨 Quality (LLM) Code quality, completeness, best practices 25%
📊 Overall Combined score (0-100) with pass/fail (≥70) 100%

*When expected field is not provided, weights are automatically normalized (HTML: 18.75%, Functional: 50%, LLM: 31.25%)

Example Output

With Expected Result:

HTML Validation: 90/100 ✅ Valid structure
Functional: 85/100 (renders ✅, 1 error, <1s)
Expected: 95/100 (95% similarity with expected result)
Quality: 80/100 (good structure, minor issues)
Overall: 87/100 ✅ PASSED

Without Expected Result:

HTML Validation: 90/100 ✅ Valid structure
Functional: 85/100 (renders ✅, 1 error, <1s)
Expected: N/A (not provided)
Quality: 80/100 (good structure, minor issues)
Overall: 85/100 ✅ PASSED

HTML Validation

Checks for: - Valid DOCTYPE declaration - Required HTML structure - Proper tag nesting - Semantic HTML usage

Functional Testing

Browser-based testing that checks: - Page renders without errors - JavaScript executes correctly - Console errors detected - Render time measured

LLM Quality Assessment

AI-powered code review evaluating: - Code quality and cleanliness - Feature completeness - Best practices adherence - Error handling

Pass/Fail Threshold

  • Pass: Overall score ≥ 70
  • Fail: Overall score < 70

Retry Logic

Each test includes automatic retry: - Default: 3 attempts - Configurable via settings - Best result is used