3.3 Output Quality Scoring (No LLM Judge in v1)
To avoid circular dependencies — C-SWON calling a model subnet to judge C-SWON workflows — all quality scoring in v1 uses deterministic, reference-based methods.
Scoring by Task Type
| Task Type | output_quality_score Method | Ground Truth Source |
|---|---|---|
| Code | Automated test pass rate + PEP8 linting score | Unit tests in benchmark task JSON |
| RAG | ROUGE-L F1 against reference answer | Reference answers in benchmark dataset |
| Agent | Binary goal checklist: pass/fail per criterion; score = passed / total | Goal checklist in benchmark task JSON |
| Data transform | Schema validation + exact-match against expected output | Expected output in benchmark task JSON |
Why ROUGE-L for RAG
ROUGE-L measures longest common subsequence overlap against a known reference answer. It is:
- Fast — no model calls required
- Deterministic — every validator produces identical scores for the same output
- Reproducible — any participant can verify scores independently
LLM judges require calling a model subnet (creating a recursive dependency) and produce non-deterministic results, making cross-validator consensus impossible in v1.
Acknowledged Limitation
ROUGE-L penalises valid paraphrasing. This is acceptable for testnet MVP benchmarks where reference answers are tightly scoped. A semantic scoring upgrade (local BERTScore or embedding-based similarity, run by validators without external calls) is planned for v2.
Navigation
| ← Previous | 3.2 Scoring Formula |
| → Next | 3.4 Anti-Gaming Mechanisms |
| Index | Documentation Index |