Trace-Native Evals

Tael doesn't bolt on a separate eval product. An eval run is a normal trace, scores are metrics (tael_eval_score), judge notes are comments, and inputs/outputs are blobs — all the same primitives you already use for production debugging. Reports are just SQL queries over that telemetry.

This means the loop from "production failure" to "regression test" to "verified fix" stays inside one tool, on one data model.

Run an eval suite

tael eval run executes a command once per line of a JSONL case file, injecting eval identity and the OTLP endpoint into each child process so its spans land back in Tael tagged to the run:

tael eval run cases.jsonl --suite coding-regression \
  --cmd 'python run_case.py {case_id}' \
  --code-version $(git rev-parse --short HEAD)

The command template expands {case_id}, {case_index}, {run_id}, and {suite_id}. Each child sees TAEL_EVAL_SUITE_ID, TAEL_EVAL_RUN_ID, TAEL_EVAL_CASE_ID, TAEL_EVAL_CASE_INDEX, TAEL_EVAL_CASE_COUNT, TAEL_EVAL_CODE_VERSION, and OTEL_EXPORTER_OTLP_ENDPOINT.

Score the run

Scores are JSONL records ingested as tael_eval_score metric points:

tael eval score run_20260528_1710 scores.jsonl

Each line requires case_id, trace_id, metric, value, and scorer; optional fields include span_id, label, threshold, rationale_sha256, and source.

Inspect, report, compare

# List recent runs
tael eval runs

# Summary of one run: pass/fail counts, avg scores, cost
tael eval status run_20260528_1710

# Per-case breakdown and a rendered report
tael eval cases run_20260528_1710
tael eval report run_20260528_1710 --format table

# Diff a run against a baseline to catch regressions
tael eval compare run_20260528_1710 run_20260527_1710

Promote production failures into golden cases

The reliability loop: a real failure you found while debugging becomes a permanent test case.

# Turn a production trace into a golden eval case
tael eval case add --from-trace abc123def456 --suite golden \
  --case-id refund_permission_denied_loop --failure-mode tool_error

# Link the case to a tracked issue
tael eval case link --case-id refund_permission_denied_loop --issue-id issue_001

# Audit a suite for hygiene: provenance, duplicates, cost risks
tael eval suite inspect golden

The floor-raising loop

production stumble   →  tael issue create        (classify the failure)
recurring issue      →  tael signal              (track it over time)
golden case          →  tael eval case add       (lock in a regression test)
fix + re-run         →  tael eval run / score
verify               →  tael eval compare        (did the floor rise?)
roll out             →  tael experiment compare  (confirm in production)

tael eval, tael issue, tael signal, and tael experiment all write structured trace comments and metrics under the hood — so an eval failure is queryable with the exact same commands as a production incident.