Architecture¶
AgentLoom is a deterministic workflow orchestrator — not an autonomous agent framework. You define the DAG, the engine executes it.
Overview¶
+-----------------------------------------------------+
| CLI / Python API |
+-----------------------------------------------------+
| Workflow Engine |
| +-----------+ +-----------+ +---------------+ |
| |DAG Parser | | Scheduler | | State Manager | |
| |& Validator| | (anyio) | | (dict+lock) | |
| +-----------+ +-----------+ +---------------+ |
+-----------------------------------------------------+
| Step Executors |
| +--------+ +---------+ +------+ +------------+ |
| |LLM Call| |Tool Exec| |Router| | Subworkflow| |
| +--------+ +---------+ +------+ +------------+ |
+-----------------------------------------------------+
| Provider Gateway |
| +-----------------------------------------------+ |
| | OpenAI | Anthropic | Google | Ollama | |
| | + Fallback | Circuit Breaker | Rate Limiter | |
| +-----------------------------------------------+ |
+-----------------------------------------------------+
| Observability (optional) |
| +------------+ +----------+ +----------+ |
| | OTel Traces| |Prometheus| | JSON Logs| |
| +------------+ +----------+ +----------+ |
+-----------------------------------------------------+
Execution flow¶
The engine processes a workflow in five phases:
1. DAG construction¶
The parser reads the YAML (or Python DSL output) and builds a directed acyclic graph from step definitions and their depends_on fields. Cycle detection runs at parse time — circular dependencies are rejected before execution starts.
2. Layer extraction¶
Topological sort groups steps into execution layers. Each layer contains steps with no inter-dependencies, meaning they can run in parallel.
graph LR
A[Layer 0: fetch, validate] --> B[Layer 1: classify]
B --> C[Layer 2: route]
C --> D[Layer 3: handle_billing, handle_technical, handle_general]
3. Parallel execution¶
For each layer, the engine:
- Filters active steps (skips those blocked by router decisions)
- Launches concurrent tasks via
anyiotask groups (up tomax_concurrent_steps) - Waits for the layer to complete before advancing
4. Router handling¶
Router steps evaluate conditions against the current state and output a target step ID. Downstream steps not activated by the router are marked as skipped — they never execute.
5. Result aggregation¶
After all layers complete, the engine collects step results and computes totals (tokens, cost, duration). The final workflow status is SUCCESS if all steps passed, FAILED if any step failed, TIMEOUT if the deadline was exceeded, or BUDGET_EXCEEDED if the cost limit was hit.
Result models¶
WorkflowResult¶
Returned by engine.run() after workflow execution completes:
| Field | Type | Description |
|---|---|---|
workflow_name |
str |
Workflow identifier |
status |
WorkflowStatus |
success, failed, timeout, or budget_exceeded |
step_results |
dict[str, StepResult] |
All step outputs and metadata |
final_state |
dict[str, Any] |
State snapshot after completion |
total_duration_ms |
float |
Total wall-clock time |
total_tokens |
int |
Sum of all tokens consumed |
total_cost_usd |
float |
Estimated total cost |
error |
str | None |
First failure message, if any |
StepResult¶
Metadata and output for each individual step:
| Field | Type | Description |
|---|---|---|
step_id |
str |
Step identifier |
status |
StepStatus |
pending, running, success, failed, skipped, or timeout |
output |
Any |
Step output value |
error |
str | None |
Failure reason |
duration_ms |
float |
Step execution time |
token_usage |
TokenUsage |
Prompt, completion, and total tokens |
cost_usd |
float |
Estimated step cost |
model |
str | None |
Model used (LLM steps) |
provider |
str | None |
Provider name |
attachment_count |
int |
Multi-modal attachment count |
time_to_first_token_ms |
float | None |
TTFT for streaming steps |
State management¶
The StateManager provides async-safe shared state with dotted-key access:
- All steps in a workflow share the same state instance
- Steps with
output: keywrite their result tostate[key] - Templates use
{state.key}or{key}(flat namespace) for interpolation - Workflow state is stored as a plain dictionary and accessed/updated through
StateManager
Provider gateway¶
The gateway routes requests based on model name and manages resilience:
| Feature | Description |
|---|---|
| Fallback | Providers are tried in priority order. If one fails, the next is attempted automatically |
| Circuit breaker | After 5 consecutive failures, a provider is taken offline for 60 seconds |
| Rate limiter | Token-bucket algorithm enforces requests/minute and tokens/minute limits |
See Providers for provider-specific details.
Design principles¶
Why not autonomous agents?
Most LLM frameworks focus on autonomous agents with self-directed reasoning and unbounded tool loops. This works for demos but breaks in production where you need predictable costs, debuggable failures, and SLA compliance.
AgentLoom takes a different approach:
- You define the DAG, not the LLM. The model generates text within a step — it does not decide what runs next.
- Observability is not optional. Every step emits traces and metrics.
- Cost is bounded. Budget limits and circuit breakers are first-class.
- Fallback is structural. Provider routing happens at the infrastructure level, not via agent "choice."