Debugging with an Agent

Tael's query surface is shaped around how an investigation actually proceeds: get the lay of the land, narrow to the suspicious traces, pull the full story for one of them, correlate the other signals, and write down what you found. This page is that order of operations.

It's written for an LLM agent driving the CLI, but the sequence is the same one a human follows — the difference is that every step returns JSON an agent can parse.

1. Orient

Don't start by guessing. Ask for the aggregate picture of the window you care about:

tael summarize --last 15m

This returns trace volume, top services, top error operations, and log/metric counts. It tells you where to look before you spend a query looking.

For "what's different from normal?", compare against a baseline window:

tael anomalies --last 5m --baseline 30m

anomalies surfaces services whose error rate or p95 latency regressed past a threshold — you don't have to define the thresholds yourself.

2. Find the suspicious traces

Narrow to candidates using whatever you learned in step 1:

# Errors on the implicated service
tael query traces --service cart --status error --last 15m

# Or slow calls
tael query traces --service cart --min-duration 1s --last 15m

# Or a specific attribute you instrumented
tael query traces --attribute flag.new_pricing=true --status error

If the symptom is described in text ("rate limit", a specific error message), search the payloads directly:

tael query traces --text "rate limit" --last 1h

3. Pull the full trace

Pick a representative trace_id and fetch every span, ordered by start time:

tael get trace abc123def456

Find where it first went wrong:

tael get trace abc123def456 | jq '.spans[] | select(.status=="error") | {op:.operation, msg:.status.message}'

4. Correlate the other signals

A trace rarely tells the whole story alone. Pull spans, logs, and metrics for the trace in one shot:

tael correlate --trace abc123def456

This is the step that usually closes the case — the log line or metric spike that explains the failing span shows up alongside it, with no manual join across stores.

If you need to go wider than one trace, drop to SQL:

tael query sql "SELECT operation, COUNT(*) AS n FROM spans
                WHERE service='cart' AND status='error' AND start_time > now() - INTERVAL '1 hour'
                GROUP BY operation ORDER BY n DESC"

5. Record what you found

Close the loop by attaching the conclusion to the evidence, so the next session doesn't start over:

tael comment add abc123def456 "Root cause: cart-service exhausted its DB connection pool under the new-pricing flag" --author claude

If the failure looks recurring rather than one-off, promote it to a tracked issue:

tael issue create --from-trace abc123def456 --failure-mode tool_error --impact high \
  --summary "cart DB pool exhaustion under new-pricing flag"

6. Watch the fix

After deploying a change, watch the relevant window for the regression to clear. tael watch polls the summary on an interval and prints signed deltas per tick:

tael watch --last 1m --interval 10 --service cart

For a human at the keyboard, tael live gives the same data as an interactive feed.

The short version

summarize / anomalies   →  where is it broken, and how bad?
query traces            →  which traces are the problem?
get trace               →  what happened inside one of them?
correlate               →  what do the logs & metrics say?
comment add / issue     →  write the finding down
watch                   →  did the fix hold?