Skip to content

Observability

Every workflow step emits OpenTelemetry traces and Prometheus metrics out of the box. No external SaaS required — the full stack runs alongside your workloads.

Quick start

# Start the observability stack
cd deploy && docker compose up -d

# Run a workflow (metrics are emitted automatically)
agentloom run examples/01_simple_qa.yaml

# Access dashboards
open http://localhost:3000    # Grafana (admin/admin)
open http://localhost:9090    # Prometheus
open http://localhost:16686   # Jaeger

Stack architecture

CLI (agentloom run)
  |
  +-- OTel SDK --> OTel Collector (:4317 gRPC)
  |                    |
  |                    +-- Prometheus exporter (:8889) --> Prometheus (:9090)
  |                    +-- OTLP exporter --> Jaeger (:16686)
  |
  +-- Observer --> MetricsManager --> OTel gauges / counters / histograms

Each CLI invocation is an ephemeral process that creates its own MeterProvider. The OTel Collector aggregates metrics and exposes them to Prometheus with a 30-minute expiration window (metric_expiration: 30m), so data remains visible after the CLI exits.

Component Port Purpose
OTel Collector 4317 (gRPC), 8889 (Prometheus) Receives OTel data, exports to Prometheus + Jaeger
Prometheus 9090 Time-series storage, PromQL queries
Jaeger 16686 Distributed tracing UI
Grafana 3000 Dashboard visualization

Grafana dashboard

The dashboard is auto-provisioned at startup. Navigate to Dashboards > AgentLoom or go directly to http://localhost:3000/d/agentloom-main.

Dashboard variables

Use the dropdown selectors at the top of the dashboard to filter:

Variable Label Default Description
$workflow Workflow .* Filter panels by workflow name
$provider Provider .* Filter panels by provider

Row 1 — Overview

Seven stat panels showing high-level aggregates:

Panel Type What it shows
Total Runs stat Total workflow executions across all workflows
Success Rate gauge Percentage of successful runs (green >= 95%, orange >= 80%, red < 80%)
Failed stat Count of failed workflow runs
Total Tokens stat Sum of all tokens consumed (input + output)
Est. Cost stat Estimated total cost in USD
Providers stat Number of distinct providers that have handled requests
Step Types stat Number of distinct step types executed

Row 2 — Workflow Performance

Panel Type What it shows
Workflow Runs timeseries Cumulative success vs failed runs over time
Workflow Duration (p50/p95/p99) timeseries Latency percentiles per workflow

Row 3 — Step Analysis

Panel Type What it shows
Step Executions timeseries Cumulative step success vs failure count over time
Step Duration (p50/p95/p99) timeseries Per-step-type latency percentiles

Row 4 — Token Economics

Panel Type What it shows
Tokens timeseries Cumulative input vs output token count per provider/model
Tokens by Provider bar gauge Horizontal bars showing total tokens per provider/model
Token Split pie chart Ratio of prompt tokens to completion tokens

Row 5 — Provider Performance

Panel Type What it shows
Provider Latency (p50/p95/p99) timeseries Per-provider response time percentiles
Provider Calls timeseries Cumulative call count per provider

Row 6 — Provider Reliability

Panel Type What it shows
Provider Errors timeseries Cumulative error count per provider
Provider Availability gauge Uptime percentage over last hour (green >= 99%, orange >= 95%, red < 95%)
Circuit Breaker stat Current circuit breaker state per provider
Circuit breaker state values
Value Label Color Meaning
0 CLOSED Green Normal — requests pass through
1 OPEN Red Tripped — requests rejected immediately
2 HALF-OPEN Yellow Recovery probe — one test request allowed

State transitions: CLOSED -> OPEN after 5 consecutive failures. OPEN -> HALF-OPEN after 60s timeout. HALF-OPEN -> CLOSED on success, back to OPEN on failure.

Row 7 — Detailed Breakdown

Panel Type What it shows
Workflow Runs Summary table Per-workflow, per-status run counts
Provider Call Details table Per-provider, per-model call counts

Row 8 — Cost Analysis

Panel Type What it shows
Cumulative Cost timeseries Running total cost in USD
Token Cost timeseries Estimated cost based on cumulative token volume per provider/model
Budget Remaining timeseries Remaining budget per workflow (green line)
Avg Cost/Run stat Mean cost per workflow execution
Avg Tokens/Run stat Mean token consumption per workflow execution
Avg Duration/Run stat Mean wall-clock time per workflow execution
Token Input/Output Ratio stat Ratio of output tokens to input tokens

Row 9 — Multi-modal

Panel Type What it shows
Attachments timeseries Attachment count by type (image, pdf, audio)

Row 10 — Streaming

Panel Type What it shows
Stream Responses timeseries Cumulative streaming response count
Time to First Token (p50/p95/p99) timeseries TTFT latency percentiles

Metrics reference

All metrics are prefixed with agentloom_.

Workflow metrics

Metric Type Labels Description
workflow_runs_total counter workflow, status Workflow execution count
workflow_duration_seconds histogram workflow End-to-end workflow latency
cost_usd_total counter Estimated USD cost
budget_remaining_usd gauge workflow Remaining budget per workflow

Step metrics

Metric Type Labels Description
step_executions_total counter step_type, status, stream Step execution count
step_duration_seconds histogram step_type, stream Per-step latency

Provider metrics

AgentLoom-specific counters live alongside the canonical OTel GenAI client histogram. The histogram replaces the previous provider_latency_seconds — distributions are required by the spec.

Metric Type Labels Description
agentloom_provider_calls_total counter provider, model, stream Provider API call count
gen_ai.client.operation.duration histogram (s) gen_ai.operation.name, gen_ai.provider.name, stream OTel canonical operation duration
agentloom_provider_errors_total counter provider, error_type Provider error count
agentloom_circuit_breaker_state gauge provider Circuit breaker state (0/1/2)

Token metrics

Metric Type Labels Description
gen_ai.client.token.usage histogram ({token}) gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.token.type OTel canonical per-call token observations. gen_ai.token.type is input / output / reasoning (reasoning is an AgentLoom extension to the spec's input/output enum)
agentloom_attachments_total counter step_type Attachment count by step type

Streaming metrics

Metric Type Labels Description
agentloom_stream_responses_total counter provider, model Streaming response count (no OTel equivalent — kept AgentLoom-specific)
gen_ai.client.operation.time_to_first_chunk histogram (s) gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model OTel canonical streaming TTFT

Span schema

AgentLoom emits a three-level span hierarchy: workflow → step → provider call (and tool call when a step invokes a registered tool). Every span / attribute / metric name is centralised in agentloom.observability.schema so downstream consumers (Grafana dashboards, AgentTest, Jaeger plugins) parse a stable contract.

Hierarchy

workflow:<workflow_name>                   # AgentLoom orchestration
└── step:<step_id>                         # AgentLoom orchestration
    └── chat <model>                       # OTel GenAI inference span — one per fallback attempt

A failed primary provider followed by a successful fallback shows up as two sibling chat <model> spans under the same step:* parent — useful for debugging fallback latency.

Attribute conventions

Inference spans follow the canonical OTel GenAI registry (May 2026 spec). Workflow / step orchestration spans use AgentLoom-specific names.

Namespace Source Example
gen_ai.* OpenTelemetry GenAI registry (canonical names) gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.request.temperature, gen_ai.request.max_tokens, gen_ai.request.stream, gen_ai.response.model, gen_ai.response.finish_reasons (array), gen_ai.response.time_to_first_chunk, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.reasoning.output_tokens
error.* OTel general semantic conventions error.type (set on errored inference spans alongside step.error)
workflow.* / step.* AgentLoom orchestration metadata workflow.run_id, workflow.status, step.id, step.type, step.duration_ms, step.cost_usd
tool.* Tool-call details tool.name, tool.args_hash, tool.success
agentloom.* AgentLoom-specific (no OTel equivalent) agentloom.prompt.hash, agentloom.approval_gate.decision, agentloom.webhook.status, agentloom.recording.provider

Workflow-level attributes

Attribute Description
workflow.name Workflow identifier
workflow.run_id Per-execution UUID — correlate Jaeger traces with checkpoints / external systems
workflow.status Final status (success / failed / paused / budget_exceeded / timeout)
workflow.duration_ms End-to-end execution time
workflow.total_tokens, workflow.total_cost_usd Aggregates across all steps

Step-level attributes

Attribute Description
step.id, step.type, step.status Step identification
step.duration_ms, step.cost_usd Per-step latency / spend
step.stream, step.attachments Streaming flag, attachment count
gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model Operation type (chat for llm_call), provider (e.g. openai, gcp.gemini), model
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens Visible token counts
gen_ai.usage.reasoning.output_tokens Chain-of-thought tokens (o-series, Gemini 2.5+ thinking) — emitted only when non-zero
gen_ai.response.finish_reasons Array of provider-supplied stop reasons (e.g. ["stop"])
gen_ai.response.time_to_first_chunk Streaming-only, in seconds
agentloom.prompt.hash, agentloom.prompt.length_chars Prompt fingerprint for correlating failures with the prompt that caused them
agentloom.prompt.template_id, agentloom.prompt.template_vars Template provenance

Inference-level attributes (provider span)

The chat <model> span — emitted once per fallback attempt by the gateway — carries the full set of GenAI inference attributes. A single step:* may have multiple sibling provider spans when fallback fires.

Attribute Description
gen_ai.operation.name Always chat for llm_call; future operation types follow the OTel registry (embeddings, execute_tool, invoke_agent, …)
gen_ai.provider.name Canonical OTel value (e.g. openai, anthropic, gcp.gemini) translated from AgentLoom's internal provider name
gen_ai.request.model, gen_ai.response.model Requested model and the model the provider actually responded with (may differ when the provider auto-resolves a version, e.g. gpt-4o-minigpt-4o-mini-2024-07-18)
gen_ai.request.temperature, gen_ai.request.max_tokens Sampling controls passed through to the provider
gen_ai.request.stream true for streaming calls — distinguishes streaming vs non-streaming inference at the inference span level
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.reasoning.output_tokens Token counts
gen_ai.response.finish_reasons Array of stop reasons
gen_ai.response.time_to_first_chunk Streaming-only
error.type Set on errored attempts (OTel general convention, alongside the AgentLoom-specific step.error)
agentloom.provider.attempt, agentloom.provider.attempt_outcome Fallback attempt index (0-indexed) and outcome (ok / error) — debugging fallback behaviour

Capture flags

Full prompt content is not captured by default — size and secrets concerns. Set config.capture_prompts: true in the workflow YAML to opt in: each llm_call span then carries an agentloom.prompt.captured OTel event with the rendered prompt and system_prompt. Event payloads avoid the attribute-size cap and stay easy to filter at the OTel collector. Off by default; opt-in for debugging or trusted environments.

When a redaction policy is configured (see State redaction), the captured copy is re-rendered against the redacted state — a flagged {state.api_key} becomes a <REDACTED:sha256=...> sentinel in the span event, while the request actually sent to the provider keeps the plaintext value. Same policy, same patterns, single source of truth: declare it once in state_schema: (or AGENTLOOM_REDACT_STATE_KEYS) and every persistence boundary — checkpoint, webhook body, span event — honours it.

Provider name translation

AgentLoom's internal provider names map to the canonical OTel registry values: googlegcp.gemini, others (openai, anthropic, ollama) match the registry as-is. Custom values (ollama, mock) ride the spec's "vendor extension" allowance. The bundled Grafana dashboard queries Prometheus metrics (not span attributes), so dashboard panels are unaffected by attribute renames.

Quality annotations

Workflow spans capture latency, tokens, and cost — but not output correctness. The WorkflowResult.annotate() API attaches post-hoc quality scores so evaluators or human reviewers can correlate execution performance with output quality.

result = await engine.run()
result.annotate("answer", quality_score=4.5, source="human_feedback", rubric="helpfulness")

Each annotation produces a standalone quality:<target> OTel span carrying:

Attribute Meaning
workflow.run_id The original run id — group quality spans with the run by joining here
workflow.name Workflow name for filtering
agentloom.quality.target The annotation target ("answer", "step:review", ...)
agentloom.quality.score Numeric score
agentloom.quality.source Producer ("human_feedback", "llm_judge", "regex", ...)
agentloom.quality.metadata.<key> Free-form metadata, flattened so each key is queryable in Jaeger

The workflow span is already closed by the time result.annotate() runs, so retroactive attribute attachment isn't possible — standalone spans keyed by workflow.run_id are the workaround. The engine wires its tracing context onto the result before returning, so result.annotate(...) auto-publishes the span the moment it's called — no extra plumbing required to see the annotation in Jaeger. Offline / replay scenarios that construct a WorkflowResult without a tracer fall back to data-only annotations; agentloom.observability.quality.emit_quality_annotations(result, tracing) is available for batch evaluators that need to push annotations through a tracer assembled later.

In Grafana / Jaeger, a query like workflow.run_id="<id>" AND name=~"quality:.*" lists every annotation attached to a run; agentloom.quality.score < 3 surfaces low-quality outputs across runs to diagnose regressions.

Per-run history records

The engine writes a JSON record to ./agentloom_runs/<run_id>.json after every workflow execution (success or failure). Records are intentionally small and self-contained so post-hoc debugging never requires replaying the workflow:

{
  "_schema_version": 1,
  "run_id": "abc123def456",
  "timestamp": "2026-05-02T18:34:18+00:00",
  "agentloom_version": "0.5.0",
  "python_version": "3.12.13",
  "workflow_name": "simple-qa",
  "workflow_hash": "sha256:...",
  "status": "success",
  "providers_used": ["openai/gpt-4o-mini"],
  "total_cost_usd": 0.012,
  "total_tokens": 320,
  "steps_executed": 5,
  "duration_ms": 3200,
  "error": null
}

Override the directory via the AGENTLOOM_RUNS_DIR env var or the runs_dir argument on RunHistoryWriter. Disk I/O happens in a worker thread so the write doesn't block the event loop, and any failure (broken directory, permissions) is logged at debug and swallowed — history is best-effort, never load-bearing.

Inspect records via the CLI:

agentloom history                                  # most recent 20 runs, table format
agentloom history --workflow simple-qa             # filter by workflow
agentloom history --provider openai                # filter by provider prefix
agentloom history --since 2026-05-01               # date filter (UTC midnight anchor)
agentloom history --since 2026-05-01 --until 2026-05-02   # date range
agentloom history --min-cost 0.10                  # cost filters
agentloom history --max-cost 1.00
agentloom history --json                           # machine-readable

--since / --until accept YYYY-MM-DD (anchored at UTC midnight) or full ISO 8601. --min-cost / --max-cost operate on total_cost_usd. Filters compose, so --workflow simple-qa --since 2026-05-01 --max-cost 0.10 lists every cheap run of simple-qa since May 1st. The table columns (TIMESTAMP, RUN ID, WORKFLOW, STATUS, COST USD, DUR MS) are stable — downstream grep / awk scripts can rely on the layout. agentloom history is distinct from agentloom runs: runs lists checkpointed-resumable executions from the configured checkpointer, while history lists every execution regardless of checkpointing.


Budget enforcement

The engine routes all workflow spend through BudgetEnforcer (agentloom.resilience.budget). The enforcer carries an anyio.Lock, so concurrent step completions in a parallel layer can't race past the limit by reading the same _spent value and both adding their cost. Two surfaces matter:

  • Pre-dispatch gate (estimate(0)) — before launching any step, the engine reads spent under the lock and refuses to start if the budget is already exhausted. This bounds the worst-case overshoot to the in-flight set of a single layer rather than letting it compound across layers.
  • Post-completion charge (charge(cost_usd)) — when a step succeeds, the engine adds the actual cost inside the lock and raises BudgetExceededError if the post-charge total is over. The exception propagates through the task group to the engine's terminal classifier, which surfaces WorkflowStatus.BUDGET_EXCEEDED.

Cross-subworkflow accounting

When a parent workflow launches a subworkflow step:

Parent budget Child budget Behaviour
Set ($0.10) None Child engine inherits the parent's enforcer. Child charges count against the parent's _spent, so the parent's gate trips at the right step.
Set ($0.10) Set ($0.05) Child uses a fresh enforcer scoped to its own limit. Pre-0.5.0 behaviour preserved. The parent's per-step accounting still adds the rolled-up subworkflow cost to the parent counter.
None Set ($0.05) Child enforces its own budget; parent has no limit to enforce.
None None No enforcement at either level.

Pause-over-budget precedence

When an approval_gate and a budget-blowing LLM step land in the same layer, pause wins. Pre-0.5.0 budget short-circuited the pause: the LLM step completed (spending the money), the pause was dropped, and the workflow ended budget_exceeded with no resumable checkpoint at the gate. The reversed precedence preserves both options — the human can --approve or --reject, and the workflow re-evaluates budget on resume. A workflow that resumes with budget already exhausted will then surface BudgetExceededError on the next dispatch (the user explicitly chose to look at the pause first).


Troubleshooting

Panels show 'No data'

Metrics are ephemeral — each CLI run exports data, then the process exits. The OTel Collector retains metrics for 30 minutes (metric_expiration).

# Verify metrics reach Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=agentloom_workflow_runs_total'

# Run a workflow to generate fresh data
agentloom run examples/01_simple_qa.yaml

# Run the circuit breaker demo for reliability panels
agentloom run examples/16_circuit_breaker_demo.yaml
Timeseries panels show flat lines

Each CLI invocation is a short-lived process that creates ephemeral counters. Timeseries panels show cumulative totals — run several workflows to see the values increase over time. Metrics expire from Prometheus after 30 minutes, so run workflows periodically.

Circuit Breaker shows OPEN unexpectedly

If you ran 16_circuit_breaker_demo.yaml, it intentionally trips the circuit breaker with a non-existent model. The OPEN state persists in Prometheus for 30 minutes. Run a successful workflow against the same provider to reset it, or wait for the metric to expire.

Traces not appearing in Jaeger

Check that the OTel Collector is running and accepting gRPC on port 4317:

docker compose ps
docker compose logs otel-collector