Deployment¶

AgentLoom runs anywhere — from a single Docker container to a fully orchestrated Kubernetes cluster with GitOps and observability. The CLI processes a workflow and exits; there is no long-running server.

Why Jobs, not Deployments?

AgentLoom is a CLI that exits after processing. Kubernetes Jobs provide finite execution, automatic retries via backoffLimit, scheduled runs via CronJobs, and clean resource isolation per workflow.

Docker¶

The multi-stage Dockerfile produces a non-root, read-only-filesystem image. The default production stage ships the [observability] extra, so a container that sets OTEL_EXPORTER_OTLP_ENDPOINT exports traces and Prometheus metrics out of the box.

# Build the default image (production — observability included)
docker build -t agentloom .

# Run a workflow
docker run --rm \
  -e OPENAI_API_KEY=sk-... \
  -v ./examples:/workflows:ro \
  agentloom run /workflows/01_simple_qa.yaml

# Override provider and model
docker run --rm \
  -e OPENAI_API_KEY=sk-... \
  -v ./examples:/workflows:ro \
  agentloom run /workflows/01_simple_qa.yaml --provider openai --model gpt-4o-mini

# Smaller image WITHOUT observability (~30 MB lighter)
docker build --target production-lite -t agentloom:lite .

Target	Observability	When to use
`production` (default)	Included	OTel tracing / Prometheus metrics needed — the common case
`production-lite`	Omitted	Minimal image, no OTel export

Docker Compose¶

Runs AgentLoom alongside the full observability stack:

cd deploy

# Start observability stack
docker compose up -d

# Run a workflow (on-demand, not part of `up`)
docker compose run --rm agentloom run /workflows/01_simple_qa.yaml

# Stop everything
docker compose down

Service	Port	Purpose
Grafana	3000	Dashboard visualization (admin/admin)
Prometheus	9090	Metrics storage and PromQL queries
Jaeger	16686	Distributed tracing UI
OTel Collector	4317	Receives traces and metrics via gRPC

Kubernetes (Kustomize)¶

Plain YAML manifests organized with Kustomize overlays. Three environments provided:

Overlay	Image Tag	Memory (req/limit)	CPU Request	NetworkPolicy	Deadline
dev	`latest`	64Mi / 128Mi	50m	Disabled	—
staging	CI SHA	128Mi / 256Mi	100m	Enabled	—
production	Pinned	256Mi / 512Mi	200m	Strict (no Ollama)	600s

# Deploy dev overlay
kubectl apply -k deploy/k8s/overlays/dev
kubectl logs job/agentloom-workflow -n agentloom

# Clean up
kubectl delete -k deploy/k8s/overlays/dev

Secrets

The base secret ships with data: {} by design. Populate it before deploying:

kubectl create secret generic agentloom-provider-keys \
  --from-literal=OPENAI_API_KEY=sk-... \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-... \
  -n agentloom --dry-run=client -o yaml | kubectl apply -f -

No CPU limits

No overlay sets CPU limits — this is intentional. CPU limits cause CFS throttling during LLM API calls (which are I/O-bound, not CPU-bound). CPU requests are set for scheduling guarantees.

Helm¶

Recommended for parameterized deployments. The chart validates inputs at render time — deploying without a workflow definition fails at render time, not at runtime.

One-shot JobCronJob (scheduled)

helm install agentloom deploy/helm/agentloom \
  -n agentloom --create-namespace \
  --set workflow.definition="$(cat examples/01_simple_qa.yaml)" \
  --set provider.existingSecret=my-secret

helm install agentloom deploy/helm/agentloom \
  -n agentloom --create-namespace \
  --set schedule.enabled=true \
  --set schedule.cron="0 */6 * * *" \
  --set workflow.definition="$(cat examples/01_simple_qa.yaml)"

Helm chart reference

Parameter	Default	Description
`image.repository`	`ghcr.io/cchinchilla-dev/agentloom`	Container image repository
`image.tag`	`""`	Image tag (defaults to `appVersion`)
`workflow.definition`	`""`	Inline workflow YAML
`workflow.existingConfigMap`	`""`	Reference to existing ConfigMap
`workflow.args`	`[]`	Extra CLI arguments
`schedule.enabled`	`false`	`false` = Job, `true` = CronJob
`schedule.cron`	`0 /6 * *`	CronJob schedule
`schedule.timeZone`	`UTC`	CronJob timezone
`provider.existingSecret`	`""`	Pre-created Secret with API keys
`provider.openaiApiKey`	`""`	OpenAI key (not recommended)
`observability.enabled`	`true`	Enable OTel endpoint env var
`observability.otelEndpoint`	`http://otel-collector:4317`	OTel Collector endpoint
`job.backoffLimit`	`2`	Retry attempts on failure
`job.ttlSecondsAfterFinished`	`3600`	Cleanup delay after completion
`job.activeDeadlineSeconds`	`null`	Hard timeout
`serviceAccount.create`	`true`	Create a ServiceAccount
`serviceAccount.automountServiceAccountToken`	`false`	Mount K8s API token
`namespace.create`	`false`	Create namespace with PSS labels
`networkPolicy.enabled`	`true`	Enable NetworkPolicy
`networkPolicy.allowOllama`	`true`	Allow egress to Ollama (port 11434)
`resourceQuota.enabled`	`false`	Enable namespace ResourceQuota

Validation: The chart fails at render time if neither workflow.definition nor workflow.existingConfigMap is set, or if both are set simultaneously.

Jobs and immutability: Kubernetes Jobs are immutable after creation. Use helm uninstall + helm install for changes, or use ArgoCD with Replace=true.

Terraform¶

Provisions a complete local development environment in one command: kind cluster + AgentLoom + observability stack.

cd deploy/terraform
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform apply

When enable_observability = true (default), Terraform deploys the full stack:

Component	URL	Purpose
OTel Collector	gRPC :4317	Receives traces and metrics
Jaeger	http://localhost:16686	Trace storage and UI
Prometheus	http://localhost:9090	Metrics scraping
Grafana	http://localhost:3000	Dashboards (admin/admin)

Set enable_observability = false for a lightweight setup without the observability stack.

ArgoCD¶

GitOps deployment with automated sync, self-heal, and retry policies. ArgoCD watches the Helm chart in the repository and syncs changes automatically.

kubectl apply -f deploy/argocd/application.yaml

The Application CRD configures:

Automated sync with prune and self-heal
Replace=true for immutable resources (K8s Jobs cannot be patched in-place)
ignoreDifferences on Job selector/labels to prevent perpetual out-of-sync
Retry policy: 5 attempts with exponential backoff (5s -> 3m max)
CreateNamespace: the agentloom namespace is created if it does not exist

Secrets

Secrets are not managed by ArgoCD. Create them manually or use External Secrets Operator / Sealed Secrets / SOPS.

Environment variables¶

Variable	Description	Default
`OPENAI_API_KEY`	OpenAI API key	—
`ANTHROPIC_API_KEY`	Anthropic API key	—
`GOOGLE_API_KEY`	Google AI API key	—
`OLLAMA_BASE_URL`	Ollama server URL	`http://localhost:11434`
`AGENTLOOM_OLLAMA_FALLBACK`	Opt-in: register Ollama as a global fallback (`1` / `true` / `yes`)	unset
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTel Collector gRPC endpoint	`http://localhost:4317`
`AGENTLOOM_RUNS_DIR`	Run-history output directory	`./agentloom_runs`

In Kubernetes, these are injected via envFrom referencing a Secret. Never bake API keys into images or commit them to source control.

Approval callback server¶

agentloom callback-server runs a lightweight HTTP server that accepts approve / reject callbacks for paused approval-gate workflows and receives webhook notifications:

agentloom callback-server --checkpoint-dir .agentloom/checkpoints --port 8642

Endpoint	Purpose
`POST /webhook`	Receive a notification. A malformed JSON body returns `400`.
`POST /approve/<run_id>`	Resume a paused workflow with an approval.
`POST /reject/<run_id>`	Resume a paused workflow with a rejection.
`GET /pending`	List paused runs.

The server binds 0.0.0.0 by default — keep it on a loopback or trusted network. For defence in depth, the opt-in --token <secret> flag requires a matching X-AgentLoom-Token header on POST /approve and POST /reject; a request without it gets 401:

agentloom callback-server --token "$CALLBACK_SECRET"
curl -X POST -H "X-AgentLoom-Token: $CALLBACK_SECRET" http://host:8642/approve/<run_id>

CI/CD pipeline¶

Workflow	Trigger	What it does
`ci.yml`	Push / PR to `main`	Lint, type check, test, Docker build, K8s validation, Helm lint
`docker.yml`	Push to `main` / tags `v*`	Build multi-arch image, push to GHCR
`release.yml`	Tags `v*`	Build wheel, GitHub release, publish to PyPI
`docs.yml`	Push to `main` (docs/mkdocs.yml)	Build and deploy documentation to GitHub Pages
`e2e-ollama.yml`	Weekly / `release/**` / `e2e` label	Ollama integration tests against live Docker instance
`pr-labeler.yml`	PR opened/updated	Auto-label PRs based on file paths
`labels.yml`	Push to `main` (labels.json)	Sync GitHub labels from config
`stale.yml`	Weekly (Monday 9am UTC)	Mark/close stale issues

Image tags¶

Event	Tag pattern	Example
Push to `main`	`main-<sha>`	`main-a1b2c3d`
Tag `v1.2.3`	`1.2.3`, `1.2`, `latest`	`1.2.3`
Pull request	`pr-<number>`	`pr-42`

Images are published to ghcr.io/cchinchilla-dev/agentloom.

Security¶

All workloads enforce the Kubernetes Pod Security Standards "restricted" profile:

Non-root container — runs as agentloom (UID 1000), enforced via Dockerfile and securityContext.runAsNonRoot
Read-only root filesystem — readOnlyRootFilesystem: true with writable /tmp via emptyDir
No privilege escalation — allowPrivilegeEscalation: false, all Linux capabilities dropped
seccomp — RuntimeDefault profile blocks dangerous syscalls
No API server access — automountServiceAccountToken: false
NetworkPolicy — restricts egress to DNS (kube-system), HTTPS (443), and OTel (4317). Production removes Ollama (11434). All ingress denied
Secrets management — API keys stored in K8s Secrets, never baked into images
Supply chain security — all GitHub Actions pinned to commit SHAs, not mutable tags
ResourceQuota — optional namespace-level limits to prevent resource exhaustion
PSS namespace labels — pod-security.kubernetes.io/enforce: restricted