Skip to main content

5.2 Evaluation Pipeline

Any participant can clone this repository, run neurons/validator.py against the benchmark dataset in benchmarks/v1.json, and reproduce every score in this pipeline. The evaluation-to-weight function is fully transparent.

Six-Stage Pipeline

Stage 1: Deterministic Task Selection

import hashlib
seed = f"{validator_hotkey}:{current_block}".encode()
h    = hashlib.sha256(seed).digest()
task_index = int.from_bytes(h, 'big') % len(benchmark_tasks)
task       = benchmark_tasks[task_index]

Stage 2: Miner Workflow Collection

Send task to 5–10 randomly selected miners with sub-block timeout (≤10s). Validate dendrite.hotkey matches queried UID before accepting.

Stage 3: Sandboxed Execution

Docker container per workflow. Track actual TAO, latency, retries, failures, steps_completed.

Stage 4: Output Quality Evaluation

Code: test pass rate + linting
RAG: ROUGE-L F1 against reference
Agent: goal checklist pass rate
Data transform: schema validation + exact-match

Stage 5: Composite Scoring

Apply four-dimensional formula. Add to rolling 100-task equal-weight window.

Stage 6: Weight Submission

Once per tempo. Normalise, cap at 15% per miner, call set_weights().

Evaluation Cadence

Parameter	Value	Configurable via
Query frequency	Async, 1 per block; ≤10s timeout	Not configurable
Score window	Rolling 100 tasks, equal weight	`validator/config.py`
Weight submission	Once per tempo (360 blocks)	`tempo` hyperparameter
N_min for exec support	30 tasks per tempo	`EXEC_SUPPORT_N_MIN`
Warmup threshold	20 tasks	`WARMUP_TASK_THRESHOLD`


← Previous	5.1 Hardware Requirements
→ Next	5.3 Weight Submission
Index	Documentation Index

Six-Stage Pipeline
Evaluation Cadence
Navigation