5.2 Evaluation Pipeline
Any participant can clone this repository, run
neurons/validator.pyagainst the benchmark dataset inbenchmarks/v1.json, and reproduce every score in this pipeline. The evaluation-to-weight function is fully transparent.
Six-Stage Pipeline
Stage 1: Deterministic Task Selection
import hashlib
seed = f"{validator_hotkey}:{current_block}".encode()
h = hashlib.sha256(seed).digest()
task_index = int.from_bytes(h, 'big') % len(benchmark_tasks)
task = benchmark_tasks[task_index]
Stage 2: Miner Workflow Collection
Send task to 5–10 randomly selected miners with sub-block timeout (≤10s). Validate dendrite.hotkey matches queried UID before accepting.
Stage 3: Sandboxed Execution
Docker container per workflow. Track actual TAO, latency, retries, failures, steps_completed.
Stage 4: Output Quality Evaluation
- Code: test pass rate + linting
- RAG: ROUGE-L F1 against reference
- Agent: goal checklist pass rate
- Data transform: schema validation + exact-match
Stage 5: Composite Scoring
Apply four-dimensional formula. Add to rolling 100-task equal-weight window.
Stage 6: Weight Submission
Once per tempo. Normalise, cap at 15% per miner, call set_weights().
Evaluation Cadence
| Parameter | Value | Configurable via |
|---|---|---|
| Query frequency | Async, 1 per block; ≤10s timeout | Not configurable |
| Score window | Rolling 100 tasks, equal weight | validator/config.py |
| Weight submission | Once per tempo (360 blocks) | tempo hyperparameter |
| N_min for exec support | 30 tasks per tempo | EXEC_SUPPORT_N_MIN |
| Warmup threshold | 20 tasks | WARMUP_TASK_THRESHOLD |
Navigation
| ← Previous | 5.1 Hardware Requirements |
| → Next | 5.3 Weight Submission |
| Index | Documentation Index |