Skip to main content

3.3 Output Quality Scoring (No LLM Judge in v1)

To avoid circular dependencies — C-SWON calling a model subnet to judge C-SWON workflows — all quality scoring in v1 uses deterministic, reference-based methods.

Scoring by Task Type

Task Type	`output_quality_score` Method	Ground Truth Source
Code	Automated test pass rate + PEP8 linting score	Unit tests in benchmark task JSON
RAG	ROUGE-L F1 against reference answer	Reference answers in benchmark dataset
Agent	Binary goal checklist: pass/fail per criterion; score = passed / total	Goal checklist in benchmark task JSON
Data transform	Schema validation + exact-match against expected output	Expected output in benchmark task JSON

Why ROUGE-L for RAG

ROUGE-L measures longest common subsequence overlap against a known reference answer. It is:

Fast — no model calls required
Deterministic — every validator produces identical scores for the same output
Reproducible — any participant can verify scores independently

LLM judges require calling a model subnet (creating a recursive dependency) and produce non-deterministic results, making cross-validator consensus impossible in v1.

Acknowledged Limitation

ROUGE-L penalises valid paraphrasing. This is acceptable for testnet MVP benchmarks where reference answers are tightly scoped. A semantic scoring upgrade (local BERTScore or embedding-based similarity, run by validators without external calls) is planned for v2.


← Previous	3.2 Scoring Formula
→ Next	3.4 Anti-Gaming Mechanisms
Index	Documentation Index

Scoring by Task Type
Why ROUGE-L for RAG
Acknowledged Limitation
Navigation