Docker Multi-Service Architecture: Lessons from Running 6 Containers in Production
Published:
Running a production system with 6+ Docker containers taught me more about resource management and failure modes than any tutorial ever could. Here’s what I learned building ViraClip’s infrastructure.
The Setup
ViraClip runs as a Docker Compose stack with these services:
services:
backend: # FastAPI API server
worker: # FFmpeg processing workers
comfyui: # AI video generation
redis: # Job queue + progress tracking
postgres: # Primary database
rust-agent: # Self-healing watchdog
Lesson 1: Resource Limits Are Not Optional
Without mem_limit and cpus constraints, ComfyUI will happily consume all available RAM when generating video, killing every other container. Always set explicit limits:
comfyui:
mem_limit: 8g
cpus: '4'
Lesson 2: Health Checks Save Your Sanity
Docker’s depends_on only waits for a container to start, not to be ready. A PostgreSQL container is “running” 3 seconds before it accepts connections. Add proper health checks:
postgres:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U $POSTGRES_USER"]
interval: 5s
timeout: 5s
retries: 10
Lesson 3: The Rust Watchdog Pattern
Instead of relying solely on Docker’s restart policies, I built a Rust agent that:
- Polls health endpoints every 30s
- Runs diagnostic commands on failure
- Attempts a graceful restart before forcing one
- Sends a webhook notification with full context
This gives much richer failure information than a bare restart: always.
Lesson 4: Redis as the Source of Truth for Job State
When a worker crashes mid-job, you need to know exactly where it failed. Storing granular progress in Redis (not just “running/done”) lets you resume or retry intelligently:
redis.hset(f"job:{job_id}", mapping={
"status": "processing",
"step": "ffmpeg_encode",
"progress": 67,
"started_at": timestamp
})
Lesson 5: Shared Volumes Need Clear Ownership
When multiple containers write to the same volume (e.g., the output/ directory), file permission conflicts are inevitable. Use a single writer pattern — one container writes, others read via API.
Building a robust multi-service system is mostly about designing for failure gracefully. Each of these lessons came from a real production outage — the best kind of teacher.
