Skip to main content

5.2 Evaluation Pipeline

Any participant can clone this repository, run neurons/validator.py against the benchmark dataset in benchmarks/v1.json, and reproduce every score in this pipeline. The evaluation-to-weight function is fully transparent.

Six-Stage Pipeline

Stage 1: Deterministic Task Selection

import hashlib
seed = f"{validator_hotkey}:{current_block}".encode()
h = hashlib.sha256(seed).digest()
task_index = int.from_bytes(h, 'big') % len(benchmark_tasks)
task = benchmark_tasks[task_index]

Stage 2: Miner Workflow Collection

Send task to 5–10 randomly selected miners with sub-block timeout (≤10s). Validate dendrite.hotkey matches queried UID before accepting.

Stage 3: Sandboxed Execution

Docker container per workflow. Track actual TAO, latency, retries, failures, steps_completed.

Stage 4: Output Quality Evaluation

  • Code: test pass rate + linting
  • RAG: ROUGE-L F1 against reference
  • Agent: goal checklist pass rate
  • Data transform: schema validation + exact-match

Stage 5: Composite Scoring

Apply four-dimensional formula. Add to rolling 100-task equal-weight window.

Stage 6: Weight Submission

Once per tempo. Normalise, cap at 15% per miner, call set_weights().

Evaluation Cadence

ParameterValueConfigurable via
Query frequencyAsync, 1 per block; ≤10s timeoutNot configurable
Score windowRolling 100 tasks, equal weightvalidator/config.py
Weight submissionOnce per tempo (360 blocks)tempo hyperparameter
N_min for exec support30 tasks per tempoEXEC_SUPPORT_N_MIN
Warmup threshold20 tasksWARMUP_TASK_THRESHOLD

← Previous5.1 Hardware Requirements
→ Next5.3 Weight Submission
IndexDocumentation Index