Skip to main content

3.3 Output Quality Scoring (No LLM Judge in v1)

To avoid circular dependencies — C-SWON calling a model subnet to judge C-SWON workflows — all quality scoring in v1 uses deterministic, reference-based methods.

Scoring by Task Type

Task Typeoutput_quality_score MethodGround Truth Source
CodeAutomated test pass rate + PEP8 linting scoreUnit tests in benchmark task JSON
RAGROUGE-L F1 against reference answerReference answers in benchmark dataset
AgentBinary goal checklist: pass/fail per criterion; score = passed / totalGoal checklist in benchmark task JSON
Data transformSchema validation + exact-match against expected outputExpected output in benchmark task JSON

Why ROUGE-L for RAG

ROUGE-L measures longest common subsequence overlap against a known reference answer. It is:

  • Fast — no model calls required
  • Deterministic — every validator produces identical scores for the same output
  • Reproducible — any participant can verify scores independently

LLM judges require calling a model subnet (creating a recursive dependency) and produce non-deterministic results, making cross-validator consensus impossible in v1.

Acknowledged Limitation

ROUGE-L penalises valid paraphrasing. This is acceptable for testnet MVP benchmarks where reference answers are tightly scoped. A semantic scoring upgrade (local BERTScore or embedding-based similarity, run by validators without external calls) is planned for v2.


← Previous3.2 Scoring Formula
→ Next3.4 Anti-Gaming Mechanisms
IndexDocumentation Index