Skip to content

Observability

Every workflow step emits OpenTelemetry traces and Prometheus metrics out of the box. No external SaaS required — the full stack runs alongside your workloads.

Quick start

# Start the observability stack
cd deploy && docker compose up -d

# Run a workflow (metrics are emitted automatically)
agentloom run examples/01_simple_qa.yaml

# Access dashboards
open http://localhost:3000    # Grafana (admin/admin)
open http://localhost:9090    # Prometheus
open http://localhost:16686   # Jaeger

Stack architecture

CLI (agentloom run)
  |
  +-- OTel SDK --> OTel Collector (:4317 gRPC)
  |                    |
  |                    +-- Prometheus exporter (:8889) --> Prometheus (:9090)
  |                    +-- OTLP exporter --> Jaeger (:16686)
  |
  +-- Observer --> MetricsManager --> OTel gauges / counters / histograms

Each CLI invocation is an ephemeral process that creates its own MeterProvider. The OTel Collector aggregates metrics and exposes them to Prometheus with a 30-minute expiration window (metric_expiration: 30m), so data remains visible after the CLI exits.

Component Port Purpose
OTel Collector 4317 (gRPC), 8889 (Prometheus) Receives OTel data, exports to Prometheus + Jaeger
Prometheus 9090 Time-series storage, PromQL queries
Jaeger 16686 Distributed tracing UI
Grafana 3000 Dashboard visualization

Grafana dashboard

The dashboard is auto-provisioned at startup. Navigate to Dashboards > AgentLoom or go directly to http://localhost:3000/d/agentloom-main.

Dashboard variables

Use the dropdown selectors at the top of the dashboard to filter:

Variable Label Default Description
$workflow Workflow .* Filter panels by workflow name
$provider Provider .* Filter panels by provider

Row 1 — Overview

Seven stat panels showing high-level aggregates:

Panel Type What it shows
Total Runs stat Total workflow executions across all workflows
Success Rate gauge Percentage of successful runs (green >= 95%, orange >= 80%, red < 80%)
Failed stat Count of failed workflow runs
Total Tokens stat Sum of all tokens consumed (input + output)
Est. Cost stat Estimated total cost in USD
Providers stat Number of distinct providers that have handled requests
Step Types stat Number of distinct step types executed

Row 2 — Workflow Performance

Panel Type What it shows
Workflow Runs timeseries Cumulative success vs failed runs over time
Workflow Duration (p50/p95/p99) timeseries Latency percentiles per workflow

Row 3 — Step Analysis

Panel Type What it shows
Step Executions timeseries Cumulative step success vs failure count over time
Step Duration (p50/p95/p99) timeseries Per-step-type latency percentiles

Row 4 — Token Economics

Panel Type What it shows
Tokens timeseries Cumulative input vs output token count per provider/model
Tokens by Provider bar gauge Horizontal bars showing total tokens per provider/model
Token Split pie chart Ratio of prompt tokens to completion tokens

Row 5 — Provider Performance

Panel Type What it shows
Provider Latency (p50/p95/p99) timeseries Per-provider response time percentiles
Provider Calls timeseries Cumulative call count per provider

Row 6 — Provider Reliability

Panel Type What it shows
Provider Errors timeseries Cumulative error count per provider
Provider Availability gauge Uptime percentage over last hour (green >= 99%, orange >= 95%, red < 95%)
Circuit Breaker stat Current circuit breaker state per provider
Circuit breaker state values
Value Label Color Meaning
0 CLOSED Green Normal — requests pass through
1 OPEN Red Tripped — requests rejected immediately
2 HALF-OPEN Yellow Recovery probe — one test request allowed

State transitions: CLOSED -> OPEN after 5 consecutive failures. OPEN -> HALF-OPEN after 60s timeout. HALF-OPEN -> CLOSED on success, back to OPEN on failure.

Row 7 — Detailed Breakdown

Panel Type What it shows
Workflow Runs Summary table Per-workflow, per-status run counts
Provider Call Details table Per-provider, per-model call counts

Row 8 — Cost Analysis

Panel Type What it shows
Cumulative Cost timeseries Running total cost in USD
Token Cost timeseries Estimated cost based on cumulative token volume per provider/model
Budget Remaining timeseries Remaining budget per workflow (green line)
Avg Cost/Run stat Mean cost per workflow execution
Avg Tokens/Run stat Mean token consumption per workflow execution
Avg Duration/Run stat Mean wall-clock time per workflow execution
Token Input/Output Ratio stat Ratio of output tokens to input tokens

Row 9 — Multi-modal

Panel Type What it shows
Attachments timeseries Attachment count by type (image, pdf, audio)

Row 10 — Streaming

Panel Type What it shows
Stream Responses timeseries Cumulative streaming response count
Time to First Token (p50/p95/p99) timeseries TTFT latency percentiles

Metrics reference

All metrics are prefixed with agentloom_.

Workflow metrics

Metric Type Labels Description
workflow_runs_total counter workflow, status Workflow execution count
workflow_duration_seconds histogram workflow End-to-end workflow latency
cost_usd_total counter Estimated USD cost
budget_remaining_usd gauge workflow Remaining budget per workflow

Step metrics

Metric Type Labels Description
step_executions_total counter step_type, status, stream Step execution count
step_duration_seconds histogram step_type, stream Per-step latency

Provider metrics

Metric Type Labels Description
provider_calls_total counter provider, model, stream Provider API call count
provider_latency_seconds histogram provider, model, stream Provider response latency
provider_errors_total counter provider, error_type Provider error count
circuit_breaker_state gauge provider Circuit breaker state (0/1/2)

Token metrics

Metric Type Labels Description
tokens_total counter provider, model, direction Token usage (input/output)
attachments_total counter step_type Attachment count by step type

Streaming metrics

Metric Type Labels Description
stream_responses_total counter provider, model Streaming response count
time_to_first_token_seconds histogram provider, model Time to first token latency

Troubleshooting

Panels show 'No data'

Metrics are ephemeral — each CLI run exports data, then the process exits. The OTel Collector retains metrics for 30 minutes (metric_expiration).

# Verify metrics reach Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=agentloom_workflow_runs_total'

# Run a workflow to generate fresh data
agentloom run examples/01_simple_qa.yaml

# Run the circuit breaker demo for reliability panels
agentloom run examples/16_circuit_breaker_demo.yaml
Timeseries panels show flat lines

Each CLI invocation is a short-lived process that creates ephemeral counters. Timeseries panels show cumulative totals — run several workflows to see the values increase over time. Metrics expire from Prometheus after 30 minutes, so run workflows periodically.

Circuit Breaker shows OPEN unexpectedly

If you ran 16_circuit_breaker_demo.yaml, it intentionally trips the circuit breaker with a non-existent model. The OPEN state persists in Prometheus for 30 minutes. Run a successful workflow against the same provider to reset it, or wait for the metric to expire.

Traces not appearing in Jaeger

Check that the OTel Collector is running and accepting gRPC on port 4317:

docker compose ps
docker compose logs otel-collector