How BlanEval Works

A systematic approach to AI evaluation and compliance documentation. From dataset definition to regulatory-ready evidence reports in six steps.

Evaluation Flow

1
Dataset
2
Evaluators
3
Run
4
Analyze
5
Calibrate
6
Report
1

Choose Your Dataset

Use our pre-built evaluation datasets for one-click testing, or upload your own custom test cases.

  • Pre-built datasets for common use cases
  • One-click testing with curated benchmarks
  • Import custom data from CSV, JSON, or API
  • Version control for all datasets
  • Tag and categorize test cases

Pre-built Datasets

Customer Support QA2,500 cases
RAG Factuality1,800 cases
Red-Team Attacks850 cases
2

Configure Evaluators

Select which quality dimensions to measure. Choose from built-in evaluators or create custom ones for your domain.

  • Relevance, factuality, hallucination risk
  • Custom evaluator definitions
  • Confidence thresholds
  • Pass/fail gate criteria
Relevance
Threshold: 0.85
Factuality
Threshold: 0.85
Hallucination
Threshold: 0.85
Robustness
Threshold: 0.85
3

Run Evaluation

Execute automated evaluation across your test cases. Run red-team attacks to surface risk signals.

  • Parallel execution for speed
  • Deterministic, reproducible runs
  • Real-time progress tracking
  • Automatic retry on failures
Running evaluation...847 / 1247

812

Passed

35

Failed

4

Analyze Failures

Drill into failed cases to understand root causes. Compare against baselines to identify regressions.

  • Detailed failure breakdowns
  • Side-by-side comparisons
  • Regression detection
  • Statistical significance testing
Hallucination detected-12% vs baseline

Test case #847: Fabricated citation

Relevance drop-5% vs baseline

Test case #923: Off-topic response

5

Calibrate with Human Review

Validate automated scores with human judgment. Build calibration datasets to improve evaluator accuracy.

  • Human-in-the-loop review workflows
  • Calibration score tracking
  • Evaluator accuracy metrics
  • Feedback loop for improvement

Human Review Queue

Case #847Pending
Case #923Reviewed
Case #1102Pending
6

Export Evidence Report

Generate comprehensive reports for stakeholders and compliance teams. Document findings, scores, and regulatory evidence.

  • PDF and CSV exports
  • Executive summaries
  • Detailed finding logs
  • Shareable report links
  • EU AI Act documentation
  • Compliance-ready templates

evaluation_report_v3.pdf

Generated Jan 15, 2024

Ready to start evaluating?

See how BlanEval can fit into your AI development workflow.