Deploying AI agents with Docker and Kubernetes in 2026: a production guide
Most engineering teams working with AI agents have been through the same moment: the agent works flawlessly on the developer's laptop and falls apart in the first week of production. The issue is rarely the model. It's the infrastructure.
Deploying AI agents to production is fundamentally different from shipping a conventional REST API. Agents consume more memory, maintain context across calls, depend on external services (LLMs, vector stores, custom tools), and have less predictable usage patterns. These characteristics demand architectural decisions that start before the first docker build.
This guide focuses on deploying AI agents with Docker and Kubernetes in real production environments: how to structure the image, configure resources, scale deliberately, and maintain observability from day one.
What changes when an AI agent goes to production
Conventional REST APIs have predictable behavior: receive a request, process it, return. AI agents are different because they can chain multiple tool calls, query databases, run reasoning loops, and maintain state across interactions. This behavior creates three core infrastructure challenges.
The first is memory consumption. An agent using an LLM via API doesn't load the model locally, but it still needs to store conversation context, intermediate tool results, and streaming buffers. Depending on the use case, memory peaks can run 4 to 10 times higher than the process's idle baseline.
The second challenge is non-deterministic latency. Calls to external LLMs can range from 200ms to several seconds depending on the provider, the model, and the context size. This directly affects timeouts, liveness probe configurations in Kubernetes, and retry strategies. The third point, often underestimated, is dependency management: a typical agent connects to an LLM, a vector store, external APIs, and custom tools. Each connection needs independent failure handling for the overall system to be resilient.
Containerizing AI agents with Docker
Structuring the image for agents
A solid Docker image for agents starts with proper dependency isolation. Use slim or alpine images as a base. For agents consuming LLMs via API (no local model), python:3.12-slim is a reliable starting point. Multi-stage builds strip build-time dependencies from the final artifact and reduce the attack surface.
# Build stage
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Final image
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
CMD ["python", "-m", "agent.server"]
For agents running local models (via Ollama or llama.cpp), image size grows considerably. In those cases, keep model weights outside the image and mount them as a volume or load from external storage at startup. Embedding 7B parameter weights (~4GB) into Docker images makes pulling and caching across nodes impractical.
Environment variables and secrets
LLM API keys and service tokens must never appear in the Dockerfile or source code. In Kubernetes, use Secret for credentials and ConfigMap for non-sensitive configuration. For local development, .env with python-dotenv is practical. In production, evaluate solutions like HashiCorp Vault, AWS Secrets Manager, or the equivalent in your cloud for automatic rotation and access auditing.
Deploying AI agents with Docker and Kubernetes: orchestration and scale
Configuring resources deliberately
Kubernetes needs well-defined resource requests and limits for AI agents. Without limits in place, an agent receiving a long conversation can consume all the memory on a node and bring down other running pods.
| Component | requests (guaranteed minimum) | limits (maximum allowed) | Notes |
|---|---|---|---|
| API-based LLM agent | 256Mi RAM, 0.5 CPU | 1Gi RAM, 2 CPU | Adjust per context size |
| Local model agent (7B) | 8Gi RAM, 2 CPU | 16Gi RAM, 4 CPU | GPU node when available |
| Tool worker (tools runner) | 128Mi RAM, 0.25 CPU | 512Mi RAM, 1 CPU | Typically stateless |
| Context cache (Redis) | 512Mi RAM, 0.5 CPU | 2Gi RAM, 1 CPU | Persistence depends on use case |
The readinessProbe and livenessProbe deserve special attention. Agents that initialize connections to external LLMs may take longer than conventional APIs to become ready. A low initialDelaySeconds causes false negatives and unnecessary restarts in the first minutes after deployment.
Scalability strategies for agents
The Horizontal Pod Autoscaler (HPA) works well for stateless agents where each request is independent. For agents with persistent conversation context, horizontal scaling requires state to be externalized, typically in Redis or a database with TTL. Never store conversation history in process memory if you intend to scale.
For heavier workloads, the asynchronous worker pattern is effective: the agent receives the task, queues it (RabbitMQ, Kafka, or SQS), and immediately returns a job ID. Parallel workers process tasks independently. This pattern decouples LLM latency from user experience and lets you scale workers independently from the entry service.
For zero-downtime rollouts, RollingUpdate with maxUnavailable: 0 ensures pods are always available during updates. Combine this with well-calibrated readiness probes to prevent new pods from receiving traffic before they are fully ready.
Observability from day one
AI agents have inherently distributed behavior: a single user request can generate 10 tool calls and 3 LLM queries. Without distributed tracing, debugging production failures becomes a guessing game.
OpenTelemetry is the emerging standard for agent instrumentation. Frameworks like LangChain, LlamaIndex, and PydanticAI have native OTEL integrations, enabling you to capture spans per tool call, LLM latency, tokens consumed, and parsing errors. Plug into Jaeger or Grafana Tempo for centralized visualization. The essential metrics to track: per-step agent latency (not just end-to-end), error rate per tool, tokens used per session, queue time when using workers, and retry rate on LLM calls.
With these practices in place, production for AI agents stops being an unknown. Getting containerization right, externalizing state, configuring resources deliberately, and instrumenting from the start are the practices that separate a working prototype from a system that handles real load. The gap between the two isn't the model — it's the infrastructure built around it.
If your team is setting up infrastructure to scale AI agents in production, our specialists can help you define the right architecture for your context. Talk to our team and move forward with your AI infrastructure. ⬇️
What does it take to deploy AI agents to production with Docker and Kubernetes?
Deploying AI agents to production with Docker and Kubernetes requires: optimized Docker images (slim, multi-stage builds), explicit requests and limits for memory and CPU, external secret management, probes calibrated to the agent's startup time, and distributed tracing with OpenTelemetry. Agent state must be externalized in Redis or a database to enable horizontal scaling without losing context.
Why are AI agents harder to containerize than conventional APIs?
AI agents consume more memory, have non-deterministic latency tied to external LLMs, and frequently maintain state across calls. This requires more conservative resource configurations, higher initialDelaySeconds in health checks, and a context externalization strategy so horizontal scaling works correctly. A conventional REST API has none of these constraints.
How do you scale AI agents horizontally in Kubernetes? Stateless agents scale normally with HPA. For agents with conversation history, state must be externalized in Redis or a database with TTL before scaling. An effective alternative is the async worker pattern: the entry service receives the request and queues it; independent workers process and return the result. This decouples LLM latency from the end-user experience.
What's the difference between deploying an agent with a local model versus an LLM API? Agents using LLMs via API (OpenAI, Anthropic, Google Gemini) have smaller images and lower hardware requirements, but depend on network latency and token quotas. Agents with local models need larger images, more RAM (8GB or more), and ideally GPUs. The choice depends on cost, acceptable latency, data privacy requirements, and request volume. BIX Tech works with both architectures depending on each operation's context.
How do you monitor AI agents in production on Kubernetes? Use OpenTelemetry for distributed tracing: capture spans per tool call, per-step agent latency, and tokens consumed per session. Essential metrics: end-to-end latency, error rate per tool, tokens per session, and LLM retry rate. Tools like Jaeger, Grafana Tempo, and Prometheus are standard in the Kubernetes ecosystem and integrate well with modern agent frameworks.








