Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased¶
Added¶
- Native tool / function calling across providers (#116). New
tools,tool_choice, andmax_tool_iterationsfields onllm_call; the LLM step dispatches via the existingToolRegistry, feeds results back, and re-prompts until the model stops asking (capped bymax_tool_iterations, default 5). Each adapter translates the unifiedToolDefinitionlist to its native shape (OpenAI / Ollama / Anthropic / Google). Parallel tool calls dispatch concurrently and preserve order; failures are reported back as text so the model can recover. Sandbox (#105), budget (#108), and per-step retry (#106) apply unchanged.MockProviderrecordings accept a list of turns per step so offline replay drives the loop end-to-end;examples/35_tool_calling.yamlships a ReAct-style example. The OpenAI-shaped parser also handles Ollama-compat responses (notypefield,argumentsas a decoded dict) — without this Ollama tool calling silently dropped every call. - Per-run experiment metadata logging (#77). Every workflow execution now writes a self-contained JSON record (
run_id, ISO timestamp, AgentLoom version, Python version, workflowsha256hash, list ofprovider/modelpairs used, status, total cost, total tokens, step count, duration) to./agentloom_runs/<run_id>.json. Override the directory via theruns_dirconstructor argument onRunHistoryWriteror theAGENTLOOM_RUNS_DIRenv var. Disk I/O happens in a worker thread so the write does not block the event loop. Records carry a_schema_version: 1field; failures during the write are logged and swallowed so a broken history directory cannot prevent the engine from returning the result. Newagentloom historyCLI subcommand lists records most-recent-first and accepts--workflow,--provider,--since YYYY-MM-DD,--until YYYY-MM-DD,--min-cost,--max-cost,--limit, and--jsonfilters — covering the full filter surface (date, workflow, cost, provider) called for in the original issue. - Quality annotations attachable to
WorkflowResult(#59). NewWorkflowResult.annotate(target, quality_score=..., source=..., **metadata)method appends a typedQualityAnnotation(target,quality_score,source,metadata) to the result so evaluators, human reviewers, or downstream scoring code can record output quality after the run completes. The annotation is auto-emitted as an OTel span the momentannotate()runs whenever the engine returned the result with a tracing context attached (the default for any workflow run with observability enabled) —result.annotate("answer", quality_score=4.5, source="human_feedback")becomes immediately visible in Jaeger with no additional plumbing on the caller side. Each annotation is published as a standalonequality:<target>span (the workflow span has already closed, so retroactive attribute attachment is not viable). Quality spans carryworkflow.run_idandworkflow.nameplusagentloom.quality.score,agentloom.quality.source,agentloom.quality.target, and free-formagentloom.quality.metadata.*attributes — Jaeger / Tempo can group quality spans with the original trace by run_id, and dashboards can filter foragentloom.quality.score < thresholdto surface regressions. Offline / replay paths that build aWorkflowResultwithout a live tracer keep working —annotate()still records the data on the result, the OTel emission just no-ops. Theagentloom.observability.quality.emit_quality_annotation/emit_quality_annotationshelpers remain available for callers that build annotations outside the engine flow (e.g. batch evaluators reading historical results from disk). - OTel span and metric schema centralization with GenAI semantic conventions (#125). The schema is a clean break — no compatibility shims for pre-#125 attribute or metric names. New
agentloom.observability.schemamodule is the single source of truth for span / attribute / metric names; downstream consumers (Grafana, AgentTest, Jaeger plugins) parse a stable contract instead of grepping for ad-hoc strings. Metrics renamed and retyped to match the OTel GenAI registry:agentloom_tokens_total(counter) →gen_ai.client.token.usage(histogram,{token}unit) withgen_ai.token.typeattribute (input/output/reasoning);agentloom_provider_latency_seconds(histogram) →gen_ai.client.operation.duration(histogram,s) withgen_ai.operation.name+gen_ai.provider.nameattributes;agentloom_time_to_first_token_seconds→gen_ai.client.operation.time_to_first_chunk. AgentLoom-specific metrics (agentloom_workflow_*,agentloom_step_*,agentloom_provider_calls_total,agentloom_cost_usd_total,agentloom_circuit_breaker_state,agentloom_budget_remaining_usd, HITL / mock / recording counters) keep theiragentloom_prefix — they have no OTel equivalent. The bundled Grafana dashboard is updated to query the new metric / label names. The legacyObserver.on_provider_callhook (which duplicated the metric emission already done byon_provider_call_end) is removed; the engine no longer fires it. Thetokens: intpositional argument onon_step_endis removed — callers now passprompt_tokens/completion_tokensas kwargs. Span attributes follow the canonical OTel GenAI registry as of the May 2026 spec —gen_ai.provider.name(the deprecatedgen_ai.systemis not emitted),gen_ai.operation.name,gen_ai.request.model,gen_ai.request.temperature,gen_ai.request.max_tokens,gen_ai.request.stream,gen_ai.response.model,gen_ai.response.finish_reasons(array of strings, per spec),gen_ai.response.time_to_first_chunk,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.usage.reasoning.output_tokens. Errored inference spans also emit the OTel general-conventions attributeerror.typealongside the AgentLoom-specificstep.errorso OTel-aware consumers (Jaeger error filters, Tempo) light up. Inference spans use the canonical name template"{operation_name} {model}"(e.g."chat gpt-4o-mini"); workflow / step orchestration spans keep the AgentLoom-specificworkflow:*/step:*names. AgentLoom-specific fields stay underworkflow.*/step.*/tool.*/agentloom.*namespaces. Provider names are translated from AgentLoom internal names to OTel registry values viato_genai_provider_name(e.g.google→gcp.gemini). Notable additions: - Provider-level child spans — gateway emits one
provider:<name>span per fallback attempt nested under the parent step span, on bothcomplete()andstream()paths, so the latency split between LLM API time and step orchestration overhead (and across fallback attempts) is visible in Jaeger. - Prompt metadata capture —
agentloom.prompt.hash,agentloom.prompt.length_chars,agentloom.prompt.template_id,agentloom.prompt.template_varsland onllm_callspans by default. Full prompt content stays off; opt in viaconfig.capture_prompts: truein the workflow YAML for debugging or trusted environments. Captured prompts ride as anagentloom.prompt.capturedspan event (not an attribute, to avoid blowing the attribute-size budget). workflow.run_idpropagation — every workflow / step / provider span inherits the per-execution UUID, so external systems can correlate traces with checkpoints.- Reasoning tokens span attribute — emitted under
gen_ai.usage.reasoning.output_tokens(the canonical OTel registry name, available since the spec added it for o-series / Claude thinking / Gemini thinking). The earlierstep.reasoning_tokens(#127) and the transitionalagentloom.gen_ai.usage.reasoning_tokensare gone. MetricNamedrift detection — the schema'sMetricNameconstants now match exactly whatmetrics.pyemits, and a regression test (test_metric_names_match_emissions) scans the source foragentloom_*literals so any drift becomes a CI failure rather than silent dashboard breakage.- Engine collaborator protocols and shared retry primitives (#112). New
agentloom.core.protocolsmodule exposesStateManagerProtocol,GatewayProtocol,ToolRegistryProtocol,ObserverProtocol,CheckpointerProtocol, andStreamCallbackProtocolas@runtime_checkabletyping.Protocolshapes —StepContextcollaborator fields stay typed asAnyplus a comment naming the protocol (thecheckpointerfield stays on the concreteBaseCheckpointercarried over from #111 to keep Pydantic v2 forward-ref resolution happy). Call sites that import the protocols still get type-checker coverage.agentloom.resiliencere-exportscompute_backoff,is_retryable_exception, andDEFAULT_RETRYABLE_STATUS_CODESso the engine andretry_with_policyshare a single source of truth for retry waveforms and retryability rules. RetryConfig.retryable_status_codes(#112) — workflow YAML field, default[429, 500, 502, 503, 504]. Step retries now bail out immediately on permanent failures (4xx not in the list) instead of consuming the retry budget; status-less exceptions (network errors, generic provider failures) stay retried as transient. Configurable per step underretry.retryable_status_codes.- Reasoning / extended-thinking tracking across providers (#127).
TokenUsagegains areasoning_tokensfield plus abillable_completion_tokensproperty;ProviderResponsegains areasoning_contentfield for providers that expose the chain-of-thought trace. A newThinkingConfigsubmodel onStepDefinition.thinkinglets workflow authors activate provider-side reasoning from YAML (enabled,budget_tokens,level,capture_reasoning); the step layer forwards the config object under a singlethinking_configkwarg and each adapter translates it to its own request shape. Token metrics emit a thirddirection="reasoning"observation when reasoning tokens are non-zero, and the step span carries the canonical OTelgen_ai.usage.reasoning.output_tokensattribute (renamed from the originalstep.reasoning_tokensvia the OTel GenAI alignment in #125). Per-provider coverage: - OpenAI o-series — parses
completion_tokens_details.reasoning_tokensfrom chat-completion and streaming responses; o1 / o3 / o4-mini infer reasoning from the model name (thinking_configis accepted for YAML uniformity but not forwarded to the wire). - Anthropic — concatenates
type="thinking"content blocks intoreasoning_content;ThinkingConfigtranslates to thethinking: {type: "enabled", budget_tokens}payload, and an explicit rawthinking={...}kwarg still wins for advanced callers. Documented limitation: Anthropic rolls thinking tokens intooutput_tokensrather than exposing a separate field, soreasoning_tokensstays0for this provider — cost is automatically correct becauseoutput_tokensalready includes the thinking volume, but the visible-vs-thinking split is unavailable from the wire. - Google Gemini 2.5+ — parses
usageMetadata.thoughtsTokenCount(defensive against the field's intermittent absence ongemini-3-flash-preview);ThinkingConfigtranslates togenerationConfig.thinkingConfig(thinkingBudget,thinkingLevel,includeThoughts); thought summary parts (thought=true) are split intoreasoning_contentkeepingcontentclean. - Ollama 0.9+ — parses
message.thinkingintoreasoning_content, with a regex fallback that strips legacy inline<think>...</think>tags fromcontent.ThinkingConfigtranslates to the top-levelthinkrequest param (<level>if set, elsetrue). Documented limitation: Ollama does not spliteval_countbetween thinking and visible tokens, soreasoning_tokensstays0for this provider; cost is unaffected (local models are free).
Changed¶
DAG.topological_sortswitched from a sorted-list-as-priority-queue (O(V² log V)) toheapq.heapify+heappop/heappush(O((V+E) log V)) (#112). Same deterministic ordering — nodes still come out by alphabetical id at each layer — so existing workflows produce identical execution plans. Notable for runs with hundreds of steps.WorkflowEngineno longer carries deadBudgetExceededError/WorkflowTimeoutErrorhandlers (#112). Anyio wraps both intoExceptionGroup, and the engine already unwraps them via the catch-all branch — the dedicated handlers were unreachable. Behaviour is unchanged; the resulting workflow status mapping (BUDGET_EXCEEDED,TIMEOUT) still applies.
Breaking changes¶
- Recording file format bumped to v2 with a versioned envelope (#107). JSON files written by
RecordingProvider(and consumed byMockProvider/agentloom replay) now carry a top-level_version: 2key alongside the captured entries (which sit at the top level themselves, keyed bystep_idor request hash). The reader treats any top-level key starting with_as metadata and ignores it, so v1 recordings load without errors. However, the request-hash algorithm changed in 0.5.0 — it now mixes model + temperature + max_tokens + extra alongside messages — so v1 recordings keyed only by the legacy messages-only hash will not match under 0.5.0+ and need to be regenerated. v1 recordings keyed bystep_idcontinue to replay unchanged. Streaming responses are now keyed under the same hash ascomplete()calls and persist the joined chunk content in the same entry shape (content,usage,cost_usd,latency_ms,finish_reason), so a recording captures both modes uniformly. - Ollama implicit fallback is now opt-in. Pre-0.5.0 every workflow that ran auto-discovery (no explicit
providers:block in the config file) registered Ollama as a global catch-all fallback, even for users who didn't run Ollama at all. Every primary-provider failure then triggered a secondary call tohttp://localhost:11434, adding 250 ms – 1 s of latency per failure plus a confusing error chain ("Provider 'anthropic' failed: 404 / Provider 'ollama' failed: model not found"). SetAGENTLOOM_OLLAMA_FALLBACK=1(truthy values only —0/false/nodo not opt in) to restore the pre-0.5.0 behaviour, listollamain the top-levelproviders:block of theagentloom.yamlconfig file, or setprovider: ollamaas the workflow's primary — any of the three keeps Ollama registered. Workflows that already pin a different provider (provider: openai) need no change. The bundled Helm chart setsAGENTLOOM_OLLAMA_FALLBACK=1automatically whenollama.enabled=trueso the in-cluster Ollama deployment stays reachable. agentloom replayis now strict by default. The mock provider raisesRecordingMismatchErrorwhen a request misses the recording, or matches a step whose prompt / system prompt / model / tools spec drifted since capture — pre-0.5.0 it silently fell through to the placeholder"Mock response", so a replay could pass green while answering a prompt the recording no longer matched. Passagentloom replay --allow-default-fallbackto restore the lenient behaviour.agentloom run --provider mockand--mock-responsesstay lenient (one-line warning on a miss). Recording files are validated against the canonical schema at load time: a malformed file, or one with_versionbelow 2, is rejected with a clear re-record hint instead of loading silently. Downstream test suites that relied on the loose match must re-record their fixtures withagentloom run --recordor opt into--allow-default-fallback.- Unknown keys in
StepDefinitionandWorkflowConfigare refused at parse time (extra="forbid"). A typo likeworkflow:forworkflow_inline:now fails atagentloom validatewith the offending field named, instead of being silently dropped and surfacing a cryptic run-time error. Workflows that carried comment-like extra keys underconfig:or a step must remove them. The half-supportedconfig.responses:field is rejected by the same rule — mock responses are configured viaresponses_file. - The default Docker image (
productionstage) now ships the[observability]extra. A container that setsOTEL_EXPORTER_OTLP_ENDPOINTexports traces out of the box; pre-0.5.0 the default build omitted OpenTelemetry and Prometheus, so observability was a silent no-op. The image is ~30 MB larger. Build--target production-litefor the smaller image without observability. The bundleddocker-compose.ymlno longer needs theBUILD_OBSERVABILITYbuild arg.
Fixed¶
- Harden gateway resilience: stream cancellation no longer trips the circuit breaker, circuit-breaker check now precedes the rate limiter in
complete(), retry backoff jitter is centralized in_jittered_backoff,RateLimitervalidatesmax_rpm >= 1/max_tpm >= 1and fails fast whentoken_count > max_tpm, andCircuitBreaker.stateis a pure read with the half-open transition isolated in_maybe_transition_to_half_open()(#106). - Record/replay correctness —
anyio.Lockaround_recordedwrites plus per-call flush, streaming captures persist chunks under the same key ascomplete(),prompt_hashnow includes model/temperature/max_tokens/extra and usesmodel_dump()for Pydantic-aware hashing (#107). - Normalize provider adapters — central
providers/_http.pyhelper withvalidate_extra_kwargs+raise_for_status; each provider declares its own kwargs allowlist; HTTP 429 now becomesRateLimitErrorwithRetry-Afterparsed; the gateway passesRateLimitErrortoCircuitBreaker.call(exclude=...)so rate-limit responses do not trip the breaker; pricing prefix-match runs longest-first;OllamaProviderhonoursOLLAMA_BASE_URL; the Google adapter warns when streaming responses lackusageMetadata(#109). - Bound the gateway candidate cache with LRU eviction so long-lived workflows do not accumulate stale provider/model entries — default 1024 entries, override via
AGENTLOOM_CANDIDATE_CACHE_MAXenv var (#109). - DAG correctness — skip propagation closes over transitive successors via
dag.transitive_successors; pause requests no longer raise inside_execute_stepand instead surface after the layer finishes; pre-dispatch budget gate rejects steps before they consume more budget; cycle detection switched to an iterative algorithm so deeply chained DAGs no longer hitRecursionError;_set_nestednow reports auto-expansion of intermediate lists with a clear message (#108). - Template hardening — opt-in strict mode for template rendering:
SafeFormatDict(strict=True)andDotAccessDict(strict=True)raiseTemplateErroron missing keys; default behaviour (warn + render empty) is unchanged.__format__now honoursformat_spec.ToolStep._resolve_argsrenders{state.x}substitutions consistently withllm_call(#110). - State and approval-gate cleanup — the unsafe
state_manager.{set_sync,get_sync}accessors are renamed to_set_sync_unsafe/_get_sync_unsafe(legacy names removed in this release); approval-gate UX moved out of the step body into the CLI rendering layer for consistency (#110). - Subworkflow observability + checkpointer propagation —
SubworkflowStepforwardsobserver,checkpointer,on_stream_chunk, andrun_idto the child engine; checkpoint JSON serialization is moved off the event loop into a worker thread; theNoopObservernow implements every hook; observer hooks accept**kwargsfor forward-compat; webhook deliveries get a configurable deadline (default 5s) with statustimeout;StepContext.checkpointeris now plumbed through (#111). - Bound the metrics gauge dictionaries
_circuit_statesand_budget_remainingwith LRU eviction so long-running deployments cannot grow per-provider or per-workflow cardinality without bound (#111). - Cost calculation for reasoning models —
calculate_cost()now sumscompletion_tokens + reasoning_tokensagainst the output rate, so workflows using OpenAI o-series, Gemini 2.5+ thinking, or Anthropic extended thinking are no longer undercharged. Budget enforcement and Prometheus cost metrics inherit the fix transitively (#127). - Pricing table refresh —
pricing.yamlextended to ~70 entries covering OpenAI GPT-5/5.1/5.2/5.3/5.4/5.5 family (with*-codex,*-mini,*-nano,*-protiers), GPT-4.1 family, the full o-series includingo1-pro/o3-pro/o3-deep-research/o4-mini-deep-research; Anthropic addsclaude-opus-4-7,claude-3-7-sonnet,claude-opus-3,claude-haiku-3, plus undated aliases (claude-sonnet-4-5,claude-opus-4-1,claude-sonnet-4-0, etc.) so callers pinning to a family name resolve correctly without the date suffix; Google addsgemini-3-pro,gemini-3-pro-preview,gemini-3.1-pro-preview,gemini-3.1-flash-image-preview,gemini-3.1-flash-lite-preview,gemini-2.0-flash-lite. The longest-prefix lookup incalculate_cost()keeps dated entries authoritative when both alias and dated form are present (#127). - Relax router AST validator to accept chained safe method calls. Until 0.5.0 the validator refused any
ast.Callwhose receiver was itself a call, so the naturalstate.x.strip().lower() == "value"predicate failed AST validation withSecurityError: Attribute calls are only allowed on names, attributes, or subscripts. Seven official example YAMLs were affected (05_content_moderation,06_lead_qualification,07_incident_triage,09_fraud_detection,11_insurance_claims,12_log_analysis,14_custom_tools_decorator). A new recursive_safe_receiverhelper accepts a chain as long as every link is an attribute on aName | Attribute | Subscriptbase — and every link still passes through_reject_attribute, so the dunder/blocklist filter blocksstate.x.lower().__class__and''.join(['a','b'])exactly as before. Regression suite intests/steps/test_router_security.py::TestRouterChainedSafeCallscovers both the positive expansions (chained.strip().lower(), slice-then-method, membership-after-chain) and the negative boundary (calls on literals, dunder after safe chain, disallowed builtins). - Cascade-skip dependents when a router or step ends in FAILED. When a router step failed for any reason (AST error, no-match-no-default, evaluator exception) every downstream branch still executed against partial state — examples were observed shipping Slack notifications, Jira tickets, and approval webhooks off runs that should have aborted. Same defect on non-router steps: a step that exhausted retries left its dependents to run with empty state. The engine layer-loop now collects FAILED steps in each layer, computes
dag.transitive_successorsof their direct children, and marks the closure asSKIPPEDwith anerrorfield naming the closest failed ancestor (BFS backward through the DAG) so the operator can audit the chain. Newconfig.on_step_failure: "skip_downstream" | "continue"knob — defaultskip_downstreamfor the new safe behaviour,continuefor fan-outs that explicitly want today's swallow-and-continue semantics. Regression suite attests/core/test_engine.py::TestFailureCascadecovers failed router → direct children, failed router → transitive descendants, no-match-no-default cascade, non-router step failure cascade, and thecontinueopt-out. - Parse-time invariants on workflow definitions. Three footguns became hard errors: (a) duplicate step ids —
WorkflowDefinitiongains a@model_validator(mode="after")that refuses two steps with the sameidand names both source indices (id='a' at indices [0, 1]), closing the silent-shadowing path where one of the two steps was lost fromfinal_state.steps; (b)config.max_concurrent_stepscarriesField(default=10, ge=1, le=1024)—0(which deadlockedanyio.CapacityLimiter(0)) and negatives (which surfaced as cryptictotal_tokens must be >= 0) now raise a clean Pydantic error at parse time; (c) parallel-eligible steps writing the sameoutput:key emit aUserWarninglisting both step ids — promote to a parse error viaconfig.strict_outputs: true. Sequential overwrite viadepends_onis exempt (intentional pattern). Regression suite attests/core/test_parser.py::TestParseTimeInvariants. - Atomic
StateManager.updateprimitive and dotted-write scalar refusal. The naturalcur = await sm.get('counter'); await sm.set('counter', cur + 1)pattern collapsed 50 parallel writers to 1 because the lock dropped between the two awaits. The newupdate(key, fn)method holds the state lock across the full callback invocation, so 50 parallelupdate('counter', lambda c: c + 1)produces final 50 deterministically.fnmust be synchronous and side-effect-free (it runs under the lock); for async transformations, compute outside the lock and pass a lambda that returns the already-computed value. The legacyget+setpair stays intentionally racy (over-locking would serialise every read against the write queue) — documented in the new docstring. Separately,_set_nestednow refuses to silently overwrite a scalar intermediate: writingoutput: "user.name"whenstate.userwas the string"alice"used to replace the string with{"name": ...}and lose the original value; it now raises a newStateWriteErrornaming the traversed prefix and the existing type. Auto-creation of missing intermediates is unchanged, soset("user.name", "bob")on an empty state still works. List intermediates with a string next-segment (e.g.data.foowhenstate.datais a list — write through the list with a dict-style key) now raiseStateWriteErrortoo, uniformly with the scalar-refusal contract; pre-0.5.0 this surfaced as a genericTypeErrorso callers that catch only the dedicated exception no longer miss the case. Regression suite attests/core/test_state.py::TestAtomicUpdate(50 parallel writers + intentionally-racyget+setregression net) andTestDottedScalarOverwriteRefused(scalar, list intermediate, traversed-prefix message). - Subworkflow state isolation and cross-boundary pause propagation. Two related boundary defects in subworkflow semantics: (a) child engines saw a full copy of parent state by default and the child's entire final state propagated back up through the parent's
output:key, with no way to encapsulate —parent_secretdeclared on the parent leaked into the child's prompt rendering and round-tripped back in the parent's recorded output; (b) when a subworkflow contained anapproval_gate, thePauseRequestedErrorraised by the child engine surfaced as a generic exception at the parent boundary, the parent reportedfailedinstead ofpaused, andagentloom resume <run_id> --approvehad no resume path because no parent-level pause was checkpointed. Fixes: new opt-inStepDefinition.isolated_state: bool = False(default keeps the pre-0.5.0 leaky behaviour for backwards compatibility) — whentrue, the child receives a freshStateManagerseeded only from its own declaredstate:plus the new explicitinput: {...}mapping; newreturn_keys: list[str] | Nonefilter drops everything except the named keys when the child's final state propagates back.SubworkflowStepdetectsWorkflowResult.status == PAUSEDand re-raisesPauseRequestedErrorwith a qualifiedparent.childstep id (e.g.sub.gate); the engine's_execute_stepPAUSED handler now preserves the qualified path on theStepResult.errorfield so the post-layer pass re-emits it for the parent checkpoint. The resume path strips the parent-step prefix when re-entering the subworkflow so the child engine sees_approval.<local_id>as expected. Regression suite attests/steps/test_subworkflow.py::TestSubworkflowStateIsolation(parent-key hiding, return-keys filtering, default-leaky preserved) andTestSubworkflowPausePropagation(pause-at-sub.gatecheckpoint, full resume round trip). - Narrow
tool_step._resolve_argsplaceholder trigger to real state references. Until 0.5.0 the heuristic"{" in valuefiredstr.format_mapon any string containing a literal{— every raw JSON / HTML / code snippet a workflow author piped intotool_args.contentortool_args.bodyblew up as a permanent error (Max string recursion exceededfor nested braces,Invalid format specifierfor{"k": true}shapes). Both surfaces shipped tool-side, sofile_writeof a JSON document andhttp_requestwith an inline JSON body each broke. The new_PLACEHOLDER_RE = re.compile(r"\{(?:state(?:\.|\[)|[A-Za-z_][A-Za-z0-9_]*[\}:!])")matches only the placeholder grammars produced bybuild_template_vars({state.foo},{state[items][0]},{name},{name:.2f},{name!r}), so a string of raw JSON / HTML / CSS passes through unchanged. A per-key escape hatch —tool_args: {body: {value: "{state.foo}", template: false}}— covers the rare case where the value looks like a placeholder but must be passed verbatim. Regression suite attests/steps/test_tool_step.py::TestPlaceholderTriggerNarrowedcovers literal JSON content / body, HTML with braces, mismatched lone braces, state-dot / state-subscript / bare / format-spec / conversion-flag placeholders, thetemplate: falseescape hatch with both string and non-string payloads, and the plain-dict pass-through. - OpenAI
base_urlnormalization no longer mangles custom enterprise gateway URLs. The pre-0.5.0 rule appended/v1whenever the URL didn't literally end in/v1, sohttps://gw.example.com/v2became…/v2/v1andhttps://gw.example.com/api/v1/foobecame…/foo/v1— both broken request URLs. The new_normalize_base_urlhelper inspects the parsed path: only a bare host (https://x,https://x/) gets the/v1suffix appended; any non-root path is preserved verbatim. Workflows that point at OpenAI's API directly are unaffected. Regression suite attests/providers/test_openai.py::TestBaseURLNormalizationcovers bare host, root path, existing/v1, alternative versions (/v2), deep paths with and without a version segment, and the empty-string passthrough. - Permanent errors classified as non-retryable. Pre-0.5.0 the retry layer treated every status-less exception as transient and burned the full retry budget — 4 × 10 s of backoff for sandbox violations, 4 × 30 s for unreachable URLs (127 s wasted in the audit's
placehold.corepro), 4 × 10 s for tool-not-found typos, template-error chains, and Pydantic validation errors.is_retryable_exceptionnow walks the cause chain looking for an explicitis_retryable = Falsemarker —SandboxViolationError,TemplateError,ValidationError,SecurityError,BudgetExceededError, the newAttachmentResolutionError(subclass ofValueErrorfor backwards compat), and the newToolNotFoundError(subclass ofKeyErrorfor backwards compat) all carry the marker. Pydantic'sValidationErroris special-cased because it's outside the AgentLoom class hierarchy.StepResult.error_classificationsurfaces the verdict to observability:"permanent"for non-retryable failures,"transient"for failures that exhausted the retry budget,Noneon success — dashboards can finally distinguish "we wasted 30 s retrying nothing" from "we actually retried a transient one".tools/registry.py::getnow raisesToolNotFoundError(still aKeyErrorsubclass), andproviders/multimodal.py::resolve_attachmentswraps deterministicValueErrorpaths (unsupported type, size limit, empty source) inAttachmentResolutionError;PermissionErrorfrom sandbox blocks andhttpx.HTTPErrorfrom network failures are intentionally NOT wrapped so the existing rules continue to apply. Regression suite attests/resilience/test_retry.py::TestPermanentMarkerClassification(sandbox, tool-not-found, attachment, template, validation, Pydantic, security, cause-chain walk, context-chain walk, status-less generic regression net),tests/tools/test_registry.py::test_missing_tool_raises_tool_not_found_error,tests/providers/test_multimodal.py::TestAttachmentResolutionErrorWrapping, andtests/core/test_engine_resilience.py::TestErrorClassificationField. BudgetEnforcerwired into the engine as the single source of truth. Pre-0.5.0 budget tracking was a barefloatonWorkflowEnginewith no lock: a parallel layer of 5 LLM calls withbudget_usd: 0.0001ran all 5 to completion before the engine noticed the overrun; the existingBudgetEnforcerclass was dead code. The new wiring routes every spend throughBudgetEnforcer.charge()under ananyio.Lockso concurrent step completions can no longer race past the limit; the pre-dispatch gate usesestimate(0)to refuse work after the budget is exhausted, bounding the worst-case overshoot to the in-flight set of a single layer rather than letting it compound across layers. The engine constructor now accepts an optionalbudget_enforcerargument so subworkflow steps can hand the parent's enforcer down to the child engine — when the child declares nobudget_usdof its own, child charges count against the parent counter and the parent's gate trips at the right step; when the child has its ownbudget_usd, it uses a fresh enforcer (pre-0.5.0 behaviour preserved for explicit child budgets). The subworkflow step re-raisesBudgetExceededError(now markedis_retryable = Falseto short-circuit the retry loop) when the child's overrun came from the shared enforcer, so the parent's terminal classifier surfacesWorkflowStatus.BUDGET_EXCEEDEDdirectly instead of letting the retry loop spin on a subworkflow that's already proven the budget is gone. Pause-over-budget precedence reversed: when anapproval_gateand a budget-blowing step land in the same layer, pause wins — the human can--approveor--rejectand the workflow re-evaluates budget on resume, instead of silently spending the money AND dropping the pause as pre-0.5.0 did. The engine's terminal-state classifier inspects both the raisedExceptionGroupAND the state manager for paused step results, so a pause swallowed by_execute_step(to let siblings complete) still surfaces even when a sibling's budget breach cancels the task group. Regression suite attests/resilience/test_budget.py::TestAsyncBudgetPrimitives(50-parallel-writer atomicity, estimate / charge / has_limit) andtests/core/test_engine_resilience.py(engine wired enforcer, parallel layer pre-dispatch, sequential overrun, child charges against parent, child with own budget uses own enforcer, pause-vs-budget precedence, budget-alone regression net, error_classification permanent vs transient).- Template engine graceful fallback for deep-miss chains and format specs on non-scalars. Two related ergonomic bugs in
agentloom.core.templates: (a){state.x.y.z}against an empty state rendered""only at the first dotted level, then the format machinery tried"".yand raisedAttributeError: 'str' object has no attribute 'y'— confusing on its own, made worse by retry-by-default consuming 4× the retry budget on a permanent failure; (b){state.user:.20}against a dict value raisedTypeError: unsupported format string passed to dict.__format__fromDotAccessDict.__format__'sformat(self._data, spec)call. Fixes: every miss path in non-strict mode now returns a_MissingDotAccesssentinel whose__getattribute__,__getattr__,__getitem__,__str__,__repr__,__format__, and__bool__all bottom out at the empty string /False(so a chained{state.missing.__class__}cannot leak the sentinel's type name and a conversion flag{state.missing!r}does not surface an object repr) — uniformly across missing dict keys, int subscripts on dicts, out-of-range list indices, and non-integer list subscripts — so a chained reference renders empty no matter how deep the miss or whether the miss happens on a dict or list. Strict mode is unchanged:TemplateErrorstill raises at the FIRST missing segment of the chain (pinned by a regression test that walks{x: {y: {}}}through{state.x.y.z}and asserts the error namesz, notxory).DotAccessDict.__format__andDotAccessList.__format__now catch theTypeErrorfrom a non-empty format spec on the underlying container and fall back tostr(...)(a warning is logged at each occurrence), so{state.user:.20}against a dict prints the dict's str form instead of crashing. Scalar leaves keep their precise float / int / str formatting. Unicode normalisation footgun (F77) documented indocs/workflow-yaml.md: state lookup is byte-exact, NFC-vs-NFD key mismatches silently render empty. Regression suite attests/core/test_templates.py::TestMissingDeepKeyChain(deep miss, deep miss with format spec, deep miss with subscript, strict-raises-at-first-miss, strict-raises-at-first-missing-segment, missing list-index chain, dict int-key chain, sentinel falsy),TestFormatSpecOnNonScalar(dict / list fallback, scalar still works), andTestUnicodeAndEscapedBraces(NFC key, NFC/NFD mismatch,{{escape). - Replay determinism —
MockProviderandRecordingProvider. Recordings now carry arequest_hashon every entry; in strict mode (agentloom replay) a step matched by id whose prompt drifted from its recording raisesRecordingMismatchErrorinstead of returning the stale answer. Recording files are schema-validated at load time, so a corrupt fixture ({"not": "valid"}) or invalid JSON fails fast with the file path named. A non-strict miss now emits a one-line warning rather than silently returning the placeholder. The gateway forwardsstep_idto providers that declareaccepts_step_id(mock, recorder), so step-id keying actually works end-to-end — previously the gateway poppedstep_idand both providers fell back to hash-only keying.RecordingProviderno longer leaksstep_idinto the wrapped HTTP adapter's kwargs. @tooldecorator hardening. A sync@toolfunction is auto-wrapped onto a worker thread viaanyio.to_thread.run_sync— pre-0.5.0 it raisedTypeError: object ... can't be used in 'await' expressionthe first time it ran. Generic type hints (list[int],dict[str, V],Optional[T], nested combinations) now produce structurally correct JSON Schema instead of silently degrading to{"type": "string"}.ToolRegistry.registerlogs a warning when overwriting an existing tool name; passreplace=Trueto override deliberately.tool_choice: requiredthat loops to the iteration cap with an empty final message gets the distinctfinish_reason = "max_tool_iterations_no_answer"plus a warning. Atool_argsreference to astate.Xkey that does not exist now raises a non-retryableStepErrornaming the missing key, instead of passingNoneto the tool and surfacing a confusing downstreamTypeError.- CLI polish. The pause hint printed by
agentloom runno longer references a--stepflag thatagentloom resumedoes not accept.--statenow decodes JSON-shaped values (--state items=[1,2,3],--state user={"name":"x"}) into real lists / dicts / numbers while leaving plain strings and URLs untouched.agentloom visualize --formatis a typed choice that rejects an unknown value with a clear error instead of silently rendering ASCII. Checkpoints written by a newer AgentLoom — a futureschema_versionor an unknown step-status enum — are refused at resume with aCheckpointSchemaErrormigration hint. - Run-history writer is quiet when the default directory is not writable. A read-only container with no
agentloom_runs/mount no longer prints a warning on every workflow run — the failure is logged at debug level. An explicitly chosen directory (runs_dirargument orAGENTLOOM_RUNS_DIR) still warns. The Helm chart ships avalues.schema.jsonso a typo or wrong-typed--setvalue is rejected byhelm lintinstead of flowing through. data:URLs in multimodal attachments. Adata:image/png;base64,…attachment is decoded inline instead of being treated as a filesystem path (which surfaced a misleadingFileNotFoundErrornaming the base64 blob). Base64 and percent-encoded payloads are both supported; a malformed URL or invalid base64 raisesAttachmentResolutionError.data:URLs make no network call and open no file, so they are allowed unconditionally under a strict sandbox. A sandbox rejection of a symlinked attachment path now names both the symlink path and the resolved target so the operator looks in the right place. Example workflows 19 and 22 swap the unreachableplacehold.coplaceholder host forhttpbin.org.- Callback server request validation.
POST /webhookparses the body — malformed JSON or a non-object body returns400instead of a silent200 received; a missingContent-Typeis accepted but logged. A path-traversalrun_idonPOST /approvereturns a clean400instead of a500with a stack trace. A new opt-in--tokenflag requires a matchingX-AgentLoom-Tokenheader onPOST /approve/POST /reject.
Security¶
- Router AST validator and
DotAccessDictruntime defence (#050).state['__class__'],state['_data'],state['_secret']previously bypassed the attribute-only guard in_validate_expressionand reachedDotAccessDict.__getitem__('_x'), which delegated toobject.__getattribute__and returned the wrapper's raw underlying dict — leaking every user-seeded_secret/_token/_internalkey. Two re-audit passes broadened the fix surface to: (a)_reject_subscriptapplies the same_reject_attributecheck toast.Subscriptslices when the slice is anast.Constantof typestr; (b) non-constant slices — variables (state[lookup]), arithmetic (state['_' + 'secret']), conditionals, calls (state[str(1)]) — are refused outright; (c) integer-constant slices andSlicenodes with integer-constant or unary-±integer bounds remain accepted (state['items'][::-1],state['items'][-2:]); (d) the router namespace no longer flattens state keys as bare names (namespace.update(state_snapshot)is dropped) so a state key calledlencannot shadow the safe builtin, and arbitrary state keys are reachable only via the documentedstate.Xprefix; (e)DotAccessDict/DotAccessListuse name-mangled storage (__data→_DotAccessDict__data) AND override__getattribute__to refuse__dict__plus the mangled storage names so{state._data},{state.__dict__}, and{state._DotAccessDict__data}no longer reach the raw underlying dict viastr.format_map; (f)DotAccessDict.__getattr__no longer falls back toobject.__getattribute__for any dynamic lookup; (g) the router runtime proxy wraps nested dict / list values in_DictProxy/_ListProxythat support both attribute and string-subscript access so the grammars the validator accepts (state.user.name,state['user']['name'],state.items[0].label) all resolve at runtime. approval_gate.notify.urlpasses through the workflow sandbox before the POST (#051).webhooks/sender.send_webhookaccepts an optionalsandbox: ToolSandbox | Noneand consultsToolSandbox.validate_webhook_urlahead of every delivery attempt; the approval-gate step builds the sandbox fromStepContext.sandbox_configso the gate's notification surface is gated by the same allowlist as every other network operation. When the workflow declaresconfig.sandbox.enabled: true, the destination must satisfyallow_network,allowed_schemes, andallowed_domains. When the sandbox is disabled, a built-in deny-list still blocks loopback, link-local (including cloud metadata at169.254.169.254), RFC 1918 / 100.64/10 CGNAT, link-local IPv6, unique-local IPv6, multicast, reserved ranges, and any non-http/httpsscheme. Two re-audit passes closed seven additional bypass classes: (a) IPv4-mapped IPv6 forms —http://[::ffff:169.254.169.254]/,http://[::ffff:127.0.0.1]/— normalised viaIPv6Address.ipv4_mappedbefore the deny-list check; (b) the unspecified addresses0.0.0.0and::, which connect to localhost on most platforms; (c) trailing-dot hostnames (http://127.0.0.1./) which httpx and most resolvers accept butipaddress.ip_addresswould otherwise reject — stripped before classification; (d) theallow_internal_webhook_targets=trueopt-in is now split from the scheme deny, so a workflow that authorises in-cluster destinations still cannot sendfile://,data:, orjavascript:webhooks; (e) percent-encoded and IDN homograph hostnames (http://%6c%6f%63%61%6c%68%6f%73%74/,http://lоcalhost/) are URL-decoded and IDNA-normalised before the literal-string check; (f) DNS resolution upgraded fromsocket.gethostbyname(single IPv4 result) tosocket.getaddrinfoso an AAAA-only or split-horizon DNS response cannot smuggle a loopback target through the gate; (g) command-argument validation now extracts the value side of--key=pathandkey=pathflag forms (tee --output=/etc/passwd,dd of=/dev/sda) — the prior_looks_like_pathheuristic skipped every token starting with-. The host-classification path usesipaddress's stdlib flags (is_loopback,is_link_local,is_private,is_reserved,is_unspecified,is_multicast) on top of the explicit network list so reserved ranges the older containment-only check missed are caught automatically. Workflows that genuinely need to notify an in-cluster service can waive only the internal-host gate viasandbox.allow_internal_webhook_targets: true(also a new field onSandboxConfig); the scheme gate stays authoritative. A blocked URL is logged with the resolved hostname + reason and is emitted to the observer ason_webhook_delivery(step_id, workflow_name, "sandbox_blocked", 0.0); the workflow's pause itself is unaffected because the pause and the notification are independent.ToolSandbox.validate_pathwrapsValueError/OSError/RuntimeError/TypeError(#051). Null-byte paths, oversized components, symlink loops (which raiseRuntimeErrorfromPath.resolve), and non-string callers (None/int/bytes, which raiseTypeError) all surface as a singleSandboxViolationErrorwith the original path in the message — callers that catch only the sandbox exception class no longer miss the case.- State-value redaction policy for persisted artefacts (#052). New
agentloom.core.redactmodule shipsRedactionPolicy(glob patterns, env-var merge),redact_state(state, policy), and a stable<REDACTED:sha256=...>sentinel; a workflow author declares per-key redaction viastate_schema:in YAML (state_schema: {api_key: {redact: true}, "*token*": {redact: true}}) or deployment-wide viaAGENTLOOM_REDACT_STATE_KEYS=api_key,password,*token*, and the engine merges the two into a single policy at construction time. After two re-audit passes the policy applies at every persistence boundary uniformly:WorkflowEngine._save_checkpointredacts the runtime state snapshot, the literalstate:block insideworkflow_definition, everystep_results[id].output(LLM calls that return structured payloads), and any step-level config field whose key matches the policy (notify.headers.api_key,tool_args.api_key, ...);WorkflowResult.final_stateandstep_resultsare redacted before the result crosses the process boundary soagentloom run --jsonandresult.model_dump_json()see sentinels; the webhook sender redactsbody_templaterendering;llm_call's opt-incapture_promptsspan event is re-rendered against the redacted state. Subworkflows inherit the parent's redaction policy AND sandbox config — without inheritance a parent that locked downapi_key: {redact: true}would have written the secret in plaintext via the child's checkpoint, and a parent'ssandbox.enabled=truewould have been bypassed by a child whose own config defaults to disabled. The in-memory state stays plaintext so a step that legitimately interpolates{state.api_key}againstapi.openai.comkeeps working — only persisted copies are masked. The sentinel is hash-stable (sha256 of the value's string form, truncated to 16 hex chars) and redaction is idempotent: a second pass over an already-redacted value preserves it byte-for-byte so diffing across resume cycles stays consistent.WorkflowDefinitionusesextra="forbid"so a typo (stat_schema:instead ofstate_schema:) fails at parse time instead of silently shipping the secret. The re-audit also identified three robustness gaps that are now closed: (a) circular state (a self-referential dict, or a list that contains itself) no longer triggersRecursionError—_walktracks visited container ids and substitutes a literal"<cycle>"marker on the second visit; (b) non-string dict keys (int / tuple, common after JSON deserialisation) are coerced tostrbeforefnmatchso they don't crash the entire checkpoint write; (c)WorkflowEngine.from_checkpointlogs an explicit warning listing the redacted keys it detects on resume — the redacted values are not magically restored, so a downstream step that references one receives the sentinel literal, and the warning surfaces this contract to the operator. Resume contract is documented: a redacted checkpoint cannot be resumed with the original secret. Lists of secrets collapse to a list of sentinels (one per element) so consumers that read shape don't break; nested dicts redact element-wise. - Harden router expression sandbox against dunder access and type bypass (GHSA-c37m-mv4j-972v, #104)
- Closes GHSA-c37m-mv4j-972v: router conditions accepted arbitrary code via
type/__class__/__subclasses__()/__call__chains. All three published payloads now raiseSecurityErrorat parse time. - Reject
ast.Attributewith_-prefix names; blockmro/format_map/__class__traversal - Reject
ast.Namewith_-prefix; rejectkwargsand starred args inCall - Drop
typefrom safe-builtins (was usable astype(x).__mro__[1].__subclasses__()) - New
SecurityErrorexception raised by the AST validator - Regression tests in
tests/steps/test_router_security.py, including verbatim payloads from the advisory - Harden tool sandbox against meta-executable, path, and url-scheme bypasses (#105)
- Denylist of meta-executables (
env,sh,bash,python,python3,xargs,eval,exec, ...) gated behind explicitdanger_opt_in - Validate relative path arguments against the configured cwd (no
../escapes) - URL schemes restricted to
http/httpsby default;file://,gopher://,ftp://rejected unless listed inallowed_schemes - Shell-op regex now catches process substitution (
<(...),>(...)) - New
SandboxConfigfields:allowed_schemes(default["http", "https"]),danger_opt_in(list[str], default[]) - Behavior change: workflows that legitimately invoke
bash,python, etc. must list each meta-executable explicitly indanger_opt_in— e.g.danger_opt_in: ["bash", "python"]. The opt-in is per-binary, not a global flag, so addingbashdoes not also enablepython.
0.4.0 - 2026-04-15¶
Added¶
agentloom replay <workflow.yaml> --recording <file.json>subcommand — re-executes a workflow against recorded responses with no API calls (#61)- YAML-configured MockProvider —
provider: mockwithresponses_file,latency_model,latency_msfields onWorkflowConfig(#76) - Production
MockProviderandRecordingProviderfor deterministic replay and offline evaluation (#76) MockProviderloads responses from a JSON file, keyed bystep_idor SHA-256 prompt hash- Latency models:
constant,normal(gaussian with seed),replay(uses recordedlatency_ms) RecordingProviderwraps any provider, captures completions to JSON, flushes per-callagentloom run --mock-responses <file>replays;--record <file>captures- Webhook notifications for approval gates — outbound HTTP on pause (#42)
WebhookConfigonStepDefinition.notifywith URL, custom headers, and body template- Async webhook sender with 3-retry exponential backoff (best-effort, never blocks pause)
agentloom callback-servercommand — lightweight HTTP server for programmatic approve/reject- Routes:
POST /approve/<run_id>,POST /reject/<run_id>,GET /pending - Shared template utilities extracted to
core/templates.py StepContextnow carriesrun_idandworkflow_namefor webhook context- Grafana dashboard "Human-in-the-Loop" row with approval gate and webhook panels
- Prometheus metrics:
approval_gates_total,webhook_deliveries_total,webhook_latency_seconds - OTel span attributes:
approval_gate.decision,webhook.status,webhook.latency_s - Example workflow (30), validation script, and K8s smoke job
- Approval gate step type — human-in-the-loop decision point (#41)
StepType.APPROVAL_GATEpauses the workflow and waits for human approval or rejection- Decision injected via
_approval.<step_id>state key on resume --approve/--rejectmutually exclusive flags onagentloom resumetimeout_secondsandon_timeoutschema fields (consumed by webhook callback server in #42)- Example workflow (29), validation script, and K8s smoke job
- Workflow pause mechanism — foundation for human-in-the-loop (#40)
PauseRequestedErrorexception for step executors to signal a pauseStepStatus.PAUSEDandWorkflowStatus.PAUSEDstatus values- Engine catches pause requests, saves checkpoint with
status=pausedandpaused_step_id, and returns cleanly - Resume from paused checkpoint skips completed steps and re-runs the paused step
- CLI treats paused workflows as non-error (exit code 0)
- Functional validation script (
scripts/validate_pause_resume.py) and K8s smoke job - Pluggable checkpoint backends with
BaseCheckpointerprotocol andFileCheckpointerdefault (JSON-to-disk) (#78) CheckpointDataPydantic model with full workflow state serialization- Engine integration: auto-generates
run_id, saves checkpoint on completion/failure, graceful handling of I/O errors WorkflowEngine.from_checkpoint()classmethod to reconstruct and resume from a checkpoint, skipping completed stepsagentloom run --checkpointand--checkpoint-dirflagsagentloom resume <run_id>CLI command to resume paused or failed workflowsagentloom runsCLI command to list all checkpointed runs- Example workflow (28) and documentation
0.3.0 - 2026-04-12¶
Added¶
- Documentation site with mkdocs-material — getting started, architecture, providers, workflow YAML reference, Python DSL, graph API, examples, observability, deployment, contributing, and changelog pages. Auto-deployed to GitHub Pages on push to main (#72)
- Multi-modal input support for
llm_callsteps — images, PDFs, and audio viaattachmentsfield (#68) - Provider-native formatting: OpenAI (images, audio), Anthropic (images, PDFs), Google (images, PDFs, audio), Ollama (images)
- URL fetching with
fetch: local(default) orfetch: providerpassthrough - SSRF protection: blocks private/reserved IP ranges (RFC 1918, loopback, link-local)
- Sandbox integration:
allowed_domains,allow_network, andreadable_pathsenforced for attachments - Attachment size limit (20 MB default)
attachment_countinStepResult, OTel span attribute, andagentloom_attachments_totalmetric- Grafana dashboard "Multi-modal" row with attachments panels
- Multi-modal workflow examples (19–24)
- Streaming support for LLM responses with real-time token output (#3)
StreamResponseaccumulator with per-provider SSE/NDJSON parsing- All 4 providers: OpenAI (SSE), Anthropic (SSE), Google (SSE), Ollama (NDJSON)
- Gateway
stream()with circuit breaker + rate limiter integration config.stream: true(workflow-level) and per-stepstream:override- CLI
--streamflag for real-time terminal output time_to_first_token_msinStepResultand OTel span attributesagentloom_stream_responses_totalandagentloom_time_to_first_token_secondsmetrics- Grafana "Streaming" dashboard row with TTFT quantiles
- Streaming examples (25–26)
AGENTLOOM_*env var prefix for all configuration overrides (#5)- YAML-based pricing table replacing hardcoded Python dict (#6)
- Provider auto-discovery moved from CLI hack to
config.discover_providers() - Ollama e2e integration tests against a live Docker instance (5 smoke tests) (#71)
- CI workflow
e2e-ollama.yml— weekly schedule,release/**branches,e2elabel on PRs, manual dispatch - Array index support in state paths (e.g.,
state.items[0],items[0].name,results[-1]) _parse_path()helper with regex-based bracket parsing inStateManager_resolve_key()and_set_nested()handle list indexing with bounds checkingDotAccessListwrapper forstr.format_map()template renderingToolStep._resolve_args()refactored to reuseStateManager._resolve_key()- CLI, Docker, and K8s smoke tests; example workflow (27)
- First-class graph API for workflow DAG analysis and export (#75)
WorkflowGraphclass withfrom_workflow()andfrom_dag()factoriesGraphNodeandGraphEdgefrozen Pydantic models- Path algorithms:
all_paths(),prime_paths(),critical_path() - Export formats:
to_dict(),to_dot()(Graphviz),to_pnml()(Petri Net),to_mermaid() - Optional
to_networkx()viapip install agentloom[graph] - Properties:
nodes,edges,roots,leaves,layers - Test coverage reporting via Codecov with 85% minimum threshold and README badge (#70)
0.2.0 - 2026-03-30¶
Added¶
- Kubernetes manifests with Kustomize overlays for dev, staging, and production (#24)
- Helm chart with Job/CronJob modes and render-time input validation (#25)
- Terraform configuration for local kind cluster with full observability stack (#26)
- ArgoCD Application CRD with automated sync and Job immutability handling (#27)
- Docker CI/CD workflow for multi-arch GHCR publishing (#23)
- Infrastructure audit scripts for static and integration validation
- Infrastructure documentation (#28)
Fixed¶
- Production NetworkPolicy OTel egress restricted to observability namespace
- Read-only filesystem audit check no longer false-passes when root FS is writable
- Terraform audit phase passes KUBECONFIG to all kubectl poll commands
- Removed duplicate kubeconform invocation that hung without stdin
- Terraform secret uses
string_datainstead ofdatafor plaintext values - GitHub Actions and image versions pinned to commit SHAs
0.1.2 - 2026-03-26¶
Added¶
- Sandbox enforcement for built-in tools — command allowlist, path restrictions (read/write separation), network domain filtering, shell operator injection prevention, write size limits (#4)
SandboxConfigmodel in workflow YAML (config.sandbox.*)SandboxViolationErrorexception- Sandbox workflow examples (
17_sandbox_allowed,18_sandbox_blocked)
Fixed¶
- Step executors (
llm_call,router,tool_step) now useawait get_state_snapshot()instead of sync.stateaccess (#8) - Removed deprecated
gemini-2.0-flashmodel
0.1.1 - 2026-03-22¶
Fixed¶
- Rate limiter now accounts for response tokens, not just prompt tokens (#11)
- README header image uses absolute URLs for PyPI compatibility (#2)
0.1.0 - 2026-03-19¶
First public release.
Added¶
- YAML and Python DSL workflow definitions (DAGs with sequential + parallel steps)
- Step types:
llm_call,tool,router(conditional),subworkflow - Provider gateway with automatic fallback (OpenAI, Anthropic, Google, Ollama)
- Circuit breaker, rate limiter, and retry with exponential backoff per provider
- Budget enforcement (hard stop when USD limit exceeded)
- Cost tracking per step, model, and provider
- OpenTelemetry traces + Prometheus metrics (optional,
pip install agentloom[all]) - CLI commands:
run,validate,visualize(ASCII + Mermaid),info - Checkpointing: save and resume workflow state to disk
- 392 tests, mypy strict, ruff clean
Known Limitations¶
- ~~No streaming support (falls back to full completion)~~ (fixed in Unreleased)
- Router expressions use first-match-wins, no priority ordering
- ~~Rate limiter doesn't account for response tokens (only prompt tokens)~~ (fixed in 0.1.1)
- ~~Provider discovery from env vars only, should be a config file~~ (fixed in Unreleased)
- ~~Shell command tool has no sandboxing (FIXME in code)~~ (fixed in 0.1.2)
- ~~File tools accept arbitrary paths (no path sanitization)~~ (fixed in 0.1.2)
- Router expressions use
eval()— must be trusted input (not user-facing) - ~~Pricing table hardcoded in Python, should be YAML config~~ (fixed in Unreleased)
- No array index support in state paths (e.g.,
state.items[0]) - ~~Sync state access in step executors bypasses async lock~~ (fixed in 0.1.2)
- Budget enforcement is post-hoc — a single expensive step can overshoot before being stopped
budget_remainingmetric only emitted to Prometheus, not OTel- Checkpoint
save_checkpointuses blocking I/O inside async method
Design Decisions¶
- httpx over provider SDKs — keeps dependencies minimal (~5 core). Trade-off: we maintain thin adapters instead of using official SDKs.
- anyio over raw asyncio — structured concurrency via task groups. Slightly less familiar but much safer for parallel step execution.
- str.format_map over Jinja2 — one fewer dependency; prompt templates don't need loops or conditionals. SafeFormatDict handles missing keys.
- Observability optional — core runs without opentelemetry or prometheus. NoopSpan/NoopTracer pattern gives zero overhead when not installed.
- Pydantic v2 — validation and serialization worth the Rust compilation trade-off. Could revisit for truly minimal environments.