Observability¶
Every workflow step emits OpenTelemetry traces and Prometheus metrics out of the box. No external SaaS required — the full stack runs alongside your workloads.
Quick start¶
# Start the observability stack
cd deploy && docker compose up -d
# Run a workflow (metrics are emitted automatically)
agentloom run examples/01_simple_qa.yaml
# Access dashboards
open http://localhost:3000 # Grafana (admin/admin)
open http://localhost:9090 # Prometheus
open http://localhost:16686 # Jaeger
Stack architecture¶
CLI (agentloom run)
|
+-- OTel SDK --> OTel Collector (:4317 gRPC)
| |
| +-- Prometheus exporter (:8889) --> Prometheus (:9090)
| +-- OTLP exporter --> Jaeger (:16686)
|
+-- Observer --> MetricsManager --> OTel gauges / counters / histograms
Each CLI invocation is an ephemeral process that creates its own MeterProvider. The OTel Collector aggregates metrics and exposes them to Prometheus with a 30-minute expiration window (metric_expiration: 30m), so data remains visible after the CLI exits.
| Component | Port | Purpose |
|---|---|---|
| OTel Collector | 4317 (gRPC), 8889 (Prometheus) | Receives OTel data, exports to Prometheus + Jaeger |
| Prometheus | 9090 | Time-series storage, PromQL queries |
| Jaeger | 16686 | Distributed tracing UI |
| Grafana | 3000 | Dashboard visualization |
Grafana dashboard¶
The dashboard is auto-provisioned at startup. Navigate to Dashboards > AgentLoom or go directly to http://localhost:3000/d/agentloom-main.
Dashboard variables¶
Use the dropdown selectors at the top of the dashboard to filter:
| Variable | Label | Default | Description |
|---|---|---|---|
$workflow |
Workflow | .* |
Filter panels by workflow name |
$provider |
Provider | .* |
Filter panels by provider |
Row 1 — Overview¶
Seven stat panels showing high-level aggregates:
| Panel | Type | What it shows |
|---|---|---|
| Total Runs | stat | Total workflow executions across all workflows |
| Success Rate | gauge | Percentage of successful runs (green >= 95%, orange >= 80%, red < 80%) |
| Failed | stat | Count of failed workflow runs |
| Total Tokens | stat | Sum of all tokens consumed (input + output) |
| Est. Cost | stat | Estimated total cost in USD |
| Providers | stat | Number of distinct providers that have handled requests |
| Step Types | stat | Number of distinct step types executed |
Row 2 — Workflow Performance¶
| Panel | Type | What it shows |
|---|---|---|
| Workflow Runs | timeseries | Cumulative success vs failed runs over time |
| Workflow Duration (p50/p95/p99) | timeseries | Latency percentiles per workflow |
Row 3 — Step Analysis¶
| Panel | Type | What it shows |
|---|---|---|
| Step Executions | timeseries | Cumulative step success vs failure count over time |
| Step Duration (p50/p95/p99) | timeseries | Per-step-type latency percentiles |
Row 4 — Token Economics¶
| Panel | Type | What it shows |
|---|---|---|
| Tokens | timeseries | Cumulative input vs output token count per provider/model |
| Tokens by Provider | bar gauge | Horizontal bars showing total tokens per provider/model |
| Token Split | pie chart | Ratio of prompt tokens to completion tokens |
Row 5 — Provider Performance¶
| Panel | Type | What it shows |
|---|---|---|
| Provider Latency (p50/p95/p99) | timeseries | Per-provider response time percentiles |
| Provider Calls | timeseries | Cumulative call count per provider |
Row 6 — Provider Reliability¶
| Panel | Type | What it shows |
|---|---|---|
| Provider Errors | timeseries | Cumulative error count per provider |
| Provider Availability | gauge | Uptime percentage over last hour (green >= 99%, orange >= 95%, red < 95%) |
| Circuit Breaker | stat | Current circuit breaker state per provider |
Circuit breaker state values
| Value | Label | Color | Meaning |
|---|---|---|---|
| 0 | CLOSED | Green | Normal — requests pass through |
| 1 | OPEN | Red | Tripped — requests rejected immediately |
| 2 | HALF-OPEN | Yellow | Recovery probe — one test request allowed |
State transitions: CLOSED -> OPEN after 5 consecutive failures. OPEN -> HALF-OPEN after 60s timeout. HALF-OPEN -> CLOSED on success, back to OPEN on failure.
Row 7 — Detailed Breakdown¶
| Panel | Type | What it shows |
|---|---|---|
| Workflow Runs Summary | table | Per-workflow, per-status run counts |
| Provider Call Details | table | Per-provider, per-model call counts |
Row 8 — Cost Analysis¶
| Panel | Type | What it shows |
|---|---|---|
| Cumulative Cost | timeseries | Running total cost in USD |
| Token Cost | timeseries | Estimated cost based on cumulative token volume per provider/model |
| Budget Remaining | timeseries | Remaining budget per workflow (green line) |
| Avg Cost/Run | stat | Mean cost per workflow execution |
| Avg Tokens/Run | stat | Mean token consumption per workflow execution |
| Avg Duration/Run | stat | Mean wall-clock time per workflow execution |
| Token Input/Output Ratio | stat | Ratio of output tokens to input tokens |
Row 9 — Multi-modal¶
| Panel | Type | What it shows |
|---|---|---|
| Attachments | timeseries | Attachment count by type (image, pdf, audio) |
Row 10 — Streaming¶
| Panel | Type | What it shows |
|---|---|---|
| Stream Responses | timeseries | Cumulative streaming response count |
| Time to First Token (p50/p95/p99) | timeseries | TTFT latency percentiles |
Metrics reference¶
All metrics are prefixed with agentloom_.
Workflow metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
workflow_runs_total |
counter | workflow, status | Workflow execution count |
workflow_duration_seconds |
histogram | workflow | End-to-end workflow latency |
cost_usd_total |
counter | — | Estimated USD cost |
budget_remaining_usd |
gauge | workflow | Remaining budget per workflow |
Step metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
step_executions_total |
counter | step_type, status, stream | Step execution count |
step_duration_seconds |
histogram | step_type, stream | Per-step latency |
Provider metrics¶
AgentLoom-specific counters live alongside the canonical OTel GenAI client histogram. The histogram replaces the previous provider_latency_seconds — distributions are required by the spec.
| Metric | Type | Labels | Description |
|---|---|---|---|
agentloom_provider_calls_total |
counter | provider, model, stream | Provider API call count |
gen_ai.client.operation.duration |
histogram (s) | gen_ai.operation.name, gen_ai.provider.name, stream | OTel canonical operation duration |
agentloom_provider_errors_total |
counter | provider, error_type | Provider error count |
agentloom_circuit_breaker_state |
gauge | provider | Circuit breaker state (0/1/2) |
Token metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
gen_ai.client.token.usage |
histogram ({token}) |
gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.token.type | OTel canonical per-call token observations. gen_ai.token.type is input / output / reasoning (reasoning is an AgentLoom extension to the spec's input/output enum) |
agentloom_attachments_total |
counter | step_type | Attachment count by step type |
Streaming metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
agentloom_stream_responses_total |
counter | provider, model | Streaming response count (no OTel equivalent — kept AgentLoom-specific) |
gen_ai.client.operation.time_to_first_chunk |
histogram (s) | gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model | OTel canonical streaming TTFT |
Span schema¶
AgentLoom emits a three-level span hierarchy: workflow → step → provider call (and tool call when a step invokes a registered tool). Every span / attribute / metric name is centralised in agentloom.observability.schema so downstream consumers (Grafana dashboards, AgentTest, Jaeger plugins) parse a stable contract.
Hierarchy¶
workflow:<workflow_name> # AgentLoom orchestration
└── step:<step_id> # AgentLoom orchestration
└── chat <model> # OTel GenAI inference span — one per fallback attempt
A failed primary provider followed by a successful fallback shows up as two sibling chat <model> spans under the same step:* parent — useful for debugging fallback latency.
Attribute conventions¶
Inference spans follow the canonical OTel GenAI registry (May 2026 spec). Workflow / step orchestration spans use AgentLoom-specific names.
| Namespace | Source | Example |
|---|---|---|
gen_ai.* |
OpenTelemetry GenAI registry (canonical names) | gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.request.temperature, gen_ai.request.max_tokens, gen_ai.request.stream, gen_ai.response.model, gen_ai.response.finish_reasons (array), gen_ai.response.time_to_first_chunk, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.reasoning.output_tokens |
error.* |
OTel general semantic conventions | error.type (set on errored inference spans alongside step.error) |
workflow.* / step.* |
AgentLoom orchestration metadata | workflow.run_id, workflow.status, step.id, step.type, step.duration_ms, step.cost_usd |
tool.* |
Tool-call details | tool.name, tool.args_hash, tool.success |
agentloom.* |
AgentLoom-specific (no OTel equivalent) | agentloom.prompt.hash, agentloom.approval_gate.decision, agentloom.webhook.status, agentloom.recording.provider |
Workflow-level attributes¶
| Attribute | Description |
|---|---|
workflow.name |
Workflow identifier |
workflow.run_id |
Per-execution UUID — correlate Jaeger traces with checkpoints / external systems |
workflow.status |
Final status (success / failed / paused / budget_exceeded / timeout) |
workflow.duration_ms |
End-to-end execution time |
workflow.total_tokens, workflow.total_cost_usd |
Aggregates across all steps |
Step-level attributes¶
| Attribute | Description |
|---|---|
step.id, step.type, step.status |
Step identification |
step.duration_ms, step.cost_usd |
Per-step latency / spend |
step.stream, step.attachments |
Streaming flag, attachment count |
gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model |
Operation type (chat for llm_call), provider (e.g. openai, gcp.gemini), model |
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens |
Visible token counts |
gen_ai.usage.reasoning.output_tokens |
Chain-of-thought tokens (o-series, Gemini 2.5+ thinking) — emitted only when non-zero |
gen_ai.response.finish_reasons |
Array of provider-supplied stop reasons (e.g. ["stop"]) |
gen_ai.response.time_to_first_chunk |
Streaming-only, in seconds |
agentloom.prompt.hash, agentloom.prompt.length_chars |
Prompt fingerprint for correlating failures with the prompt that caused them |
agentloom.prompt.template_id, agentloom.prompt.template_vars |
Template provenance |
Inference-level attributes (provider span)¶
The chat <model> span — emitted once per fallback attempt by the gateway — carries the full set of GenAI inference attributes. A single step:* may have multiple sibling provider spans when fallback fires.
| Attribute | Description |
|---|---|
gen_ai.operation.name |
Always chat for llm_call; future operation types follow the OTel registry (embeddings, execute_tool, invoke_agent, …) |
gen_ai.provider.name |
Canonical OTel value (e.g. openai, anthropic, gcp.gemini) translated from AgentLoom's internal provider name |
gen_ai.request.model, gen_ai.response.model |
Requested model and the model the provider actually responded with (may differ when the provider auto-resolves a version, e.g. gpt-4o-mini → gpt-4o-mini-2024-07-18) |
gen_ai.request.temperature, gen_ai.request.max_tokens |
Sampling controls passed through to the provider |
gen_ai.request.stream |
true for streaming calls — distinguishes streaming vs non-streaming inference at the inference span level |
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.reasoning.output_tokens |
Token counts |
gen_ai.response.finish_reasons |
Array of stop reasons |
gen_ai.response.time_to_first_chunk |
Streaming-only |
error.type |
Set on errored attempts (OTel general convention, alongside the AgentLoom-specific step.error) |
agentloom.provider.attempt, agentloom.provider.attempt_outcome |
Fallback attempt index (0-indexed) and outcome (ok / error) — debugging fallback behaviour |
Capture flags¶
Full prompt content is not captured by default — size and secrets concerns. Set config.capture_prompts: true in the workflow YAML to opt in: each llm_call span then carries an agentloom.prompt.captured OTel event with the rendered prompt and system_prompt. Event payloads avoid the attribute-size cap and stay easy to filter at the OTel collector. Off by default; opt-in for debugging or trusted environments.
When a redaction policy is configured (see State redaction), the captured copy is re-rendered against the redacted state — a flagged {state.api_key} becomes a <REDACTED:sha256=...> sentinel in the span event, while the request actually sent to the provider keeps the plaintext value. Same policy, same patterns, single source of truth: declare it once in state_schema: (or AGENTLOOM_REDACT_STATE_KEYS) and every persistence boundary — checkpoint, webhook body, span event — honours it.
Provider name translation¶
AgentLoom's internal provider names map to the canonical OTel registry values: google → gcp.gemini, others (openai, anthropic, ollama) match the registry as-is. Custom values (ollama, mock) ride the spec's "vendor extension" allowance. The bundled Grafana dashboard queries Prometheus metrics (not span attributes), so dashboard panels are unaffected by attribute renames.
Quality annotations¶
Workflow spans capture latency, tokens, and cost — but not output correctness. The WorkflowResult.annotate() API attaches post-hoc quality scores so evaluators or human reviewers can correlate execution performance with output quality.
result = await engine.run()
result.annotate("answer", quality_score=4.5, source="human_feedback", rubric="helpfulness")
Each annotation produces a standalone quality:<target> OTel span carrying:
| Attribute | Meaning |
|---|---|
workflow.run_id |
The original run id — group quality spans with the run by joining here |
workflow.name |
Workflow name for filtering |
agentloom.quality.target |
The annotation target ("answer", "step:review", ...) |
agentloom.quality.score |
Numeric score |
agentloom.quality.source |
Producer ("human_feedback", "llm_judge", "regex", ...) |
agentloom.quality.metadata.<key> |
Free-form metadata, flattened so each key is queryable in Jaeger |
The workflow span is already closed by the time result.annotate() runs, so retroactive attribute attachment isn't possible — standalone spans keyed by workflow.run_id are the workaround. The engine wires its tracing context onto the result before returning, so result.annotate(...) auto-publishes the span the moment it's called — no extra plumbing required to see the annotation in Jaeger. Offline / replay scenarios that construct a WorkflowResult without a tracer fall back to data-only annotations; agentloom.observability.quality.emit_quality_annotations(result, tracing) is available for batch evaluators that need to push annotations through a tracer assembled later.
In Grafana / Jaeger, a query like workflow.run_id="<id>" AND name=~"quality:.*" lists every annotation attached to a run; agentloom.quality.score < 3 surfaces low-quality outputs across runs to diagnose regressions.
Per-run history records¶
The engine writes a JSON record to ./agentloom_runs/<run_id>.json after every workflow execution (success or failure). Records are intentionally small and self-contained so post-hoc debugging never requires replaying the workflow:
{
"_schema_version": 1,
"run_id": "abc123def456",
"timestamp": "2026-05-02T18:34:18+00:00",
"agentloom_version": "0.5.0",
"python_version": "3.12.13",
"workflow_name": "simple-qa",
"workflow_hash": "sha256:...",
"status": "success",
"providers_used": ["openai/gpt-4o-mini"],
"total_cost_usd": 0.012,
"total_tokens": 320,
"steps_executed": 5,
"duration_ms": 3200,
"error": null
}
Override the directory via the AGENTLOOM_RUNS_DIR env var or the runs_dir argument on RunHistoryWriter. Disk I/O happens in a worker thread so the write doesn't block the event loop, and any failure (broken directory, permissions) is logged at debug and swallowed — history is best-effort, never load-bearing.
Inspect records via the CLI:
agentloom history # most recent 20 runs, table format
agentloom history --workflow simple-qa # filter by workflow
agentloom history --provider openai # filter by provider prefix
agentloom history --since 2026-05-01 # date filter (UTC midnight anchor)
agentloom history --since 2026-05-01 --until 2026-05-02 # date range
agentloom history --min-cost 0.10 # cost filters
agentloom history --max-cost 1.00
agentloom history --json # machine-readable
--since / --until accept YYYY-MM-DD (anchored at UTC midnight) or full ISO 8601. --min-cost / --max-cost operate on total_cost_usd. Filters compose, so --workflow simple-qa --since 2026-05-01 --max-cost 0.10 lists every cheap run of simple-qa since May 1st. The table columns (TIMESTAMP, RUN ID, WORKFLOW, STATUS, COST USD, DUR MS) are stable — downstream grep / awk scripts can rely on the layout. agentloom history is distinct from agentloom runs: runs lists checkpointed-resumable executions from the configured checkpointer, while history lists every execution regardless of checkpointing.
Budget enforcement¶
The engine routes all workflow spend through BudgetEnforcer (agentloom.resilience.budget). The enforcer carries an anyio.Lock, so concurrent step completions in a parallel layer can't race past the limit by reading the same _spent value and both adding their cost. Two surfaces matter:
- Pre-dispatch gate (
estimate(0)) — before launching any step, the engine readsspentunder the lock and refuses to start if the budget is already exhausted. This bounds the worst-case overshoot to the in-flight set of a single layer rather than letting it compound across layers. - Post-completion charge (
charge(cost_usd)) — when a step succeeds, the engine adds the actual cost inside the lock and raisesBudgetExceededErrorif the post-charge total is over. The exception propagates through the task group to the engine's terminal classifier, which surfacesWorkflowStatus.BUDGET_EXCEEDED.
Cross-subworkflow accounting¶
When a parent workflow launches a subworkflow step:
| Parent budget | Child budget | Behaviour |
|---|---|---|
Set ($0.10) |
None | Child engine inherits the parent's enforcer. Child charges count against the parent's _spent, so the parent's gate trips at the right step. |
Set ($0.10) |
Set ($0.05) |
Child uses a fresh enforcer scoped to its own limit. Pre-0.5.0 behaviour preserved. The parent's per-step accounting still adds the rolled-up subworkflow cost to the parent counter. |
| None | Set ($0.05) |
Child enforces its own budget; parent has no limit to enforce. |
| None | None | No enforcement at either level. |
Pause-over-budget precedence¶
When an approval_gate and a budget-blowing LLM step land in the same layer, pause wins. Pre-0.5.0 budget short-circuited the pause: the LLM step completed (spending the money), the pause was dropped, and the workflow ended budget_exceeded with no resumable checkpoint at the gate. The reversed precedence preserves both options — the human can --approve or --reject, and the workflow re-evaluates budget on resume. A workflow that resumes with budget already exhausted will then surface BudgetExceededError on the next dispatch (the user explicitly chose to look at the pause first).
Troubleshooting¶
Panels show 'No data'
Metrics are ephemeral — each CLI run exports data, then the process exits. The OTel Collector retains metrics for 30 minutes (metric_expiration).
# Verify metrics reach Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=agentloom_workflow_runs_total'
# Run a workflow to generate fresh data
agentloom run examples/01_simple_qa.yaml
# Run the circuit breaker demo for reliability panels
agentloom run examples/16_circuit_breaker_demo.yaml
Timeseries panels show flat lines
Each CLI invocation is a short-lived process that creates ephemeral counters. Timeseries panels show cumulative totals — run several workflows to see the values increase over time. Metrics expire from Prometheus after 30 minutes, so run workflows periodically.
Circuit Breaker shows OPEN unexpectedly
If you ran 16_circuit_breaker_demo.yaml, it intentionally trips the circuit breaker with a non-existent model. The OPEN state persists in Prometheus for 30 minutes. Run a successful workflow against the same provider to reset it, or wait for the metric to expire.