Deployment Optimization Guide
Overview
This document explains the optimized CI/CD pipeline that achieves near-zero downtime deployments with 2-3 minute total deployment time (down from 25+ minutes).
Key Improvements
Before (Old Pipeline - 25+ minutes)
- Build in CI (3-5 min)
- rsync all files including node_modules (2-3 min)
docker-compose down- FULL DOWNTIME STARTSdocker-compose build --no-cache- Rebuild from scratch (15-20 min)docker-compose up -d- Wait for health checks (1-2 min)
- TOTAL DOWNTIME: 18-25 minutes
After (New Pipeline - 2-3 minutes)
- Build Docker images in CI with layer caching (2-3 min)
- rsync only config files (exclude node_modules, .next) (10-20 sec)
- Tag current images for rollback (1 sec)
- Run migrations on running container (5-10 sec)
- Rolling update - start new containers alongside old (30 sec)
- Wait for health checks (10-20 sec)
- Docker automatically stops old containers when new ones are healthy
- TOTAL DOWNTIME: < 5 seconds (just the container swap)
How Rolling Updates Work
Docker Compose Rolling Update Strategy
# Start new containers with new image, keep old ones running
docker-compose up -d --no-deps --build app
# Docker Compose behavior:
# 1. Creates new container with new image
# 2. Starts new container
# 3. Waits for new container to be healthy (healthcheck)
# 4. Only then stops old container
# 5. Traffic switches to new container
Benefits
- Old container stays running until new one is healthy
- Traffic continues to old container during new container startup
- Automatic rollback if new container fails health checks
- Minimal downtime (only the few milliseconds to swap containers)
Pipeline Stages Explained
Build Stage (2-3 minutes)
docker build --cache-from reprise-app:latest --tag ${IMAGE_TAG} .
What this does:
- Builds Docker image in CI runner workspace
- Uses
--cache-fromto reuse previous image layers (HUGE speedup) - Tags with commit SHA for versioning
- Also tags as
:latestfor next build's cache
Time savings:
- Without cache: 15-20 minutes (rebuilds everything)
- With cache: 2-3 minutes (only rebuilds changed layers)
Deploy Stage (30-60 seconds)
1. Smart rsync (10-20 seconds)
rsync --exclude='node_modules' --exclude='.next' ...
- Excludes built artifacts (they're in Docker image)
- Only syncs: docker-compose.yml, prisma/, public/, config files
- Before: 2-3 minutes to sync everything
- After: 10-20 seconds for configs only
2. Image Tagging (1 second)
docker tag reprise-app:latest reprise-app:backup-${TIMESTAMP}
- Tags current running image for instant rollback
- No image copying, just metadata update
- Keeps last 3 backups automatically
3. Zero-Downtime Migrations (5-10 seconds)
docker-compose exec -T app npx prisma migrate deploy
- Runs migrations on currently running container
- If migration fails, deployment stops, old version keeps running
- No downtime during migration
4. Rolling Update (30 seconds)
docker-compose up -d --no-deps --build app
- Starts new container with new image
- Old container keeps serving traffic
- Health checks verify new container
- Only swaps when new container is ready
5. Health Verification (10-20 seconds)
- Container health check (Docker native)
- HTTP endpoint health check
- Automatic rollback if fails
Downtime Comparison
| Deployment Step | Old Pipeline | New Pipeline |
|---|---|---|
| Build Phase | 3-5 min | 2-3 min (in parallel, no downtime) |
| Sync Files | 2-3 min (app offline) | 10-20 sec (app running) |
| Build Docker | 15-20 min (app offline) | 0 sec (pre-built in CI) |
| Stop Containers | N/A | 10-15 sec (graceful shutdown) |
| Start Container | 1-2 min (app offline) | 20-30 sec (with new image) |
| TOTAL DOWNTIME | 18-25 minutes | 30-45 seconds |
| TOTAL TIME | 25-30 minutes | 3-4 minutes |
Note: Currently using stop/rm/up sequence due to docker-compose v1.29.2 bug. Upgrading to docker-compose v2 would enable true zero-downtime rolling updates.
Cache Strategy
Docker Layer Caching
Docker builds in layers. Each RUN, COPY, ADD creates a layer:
FROM node:20-alpine # Layer 1 (cached)
WORKDIR /app # Layer 2 (cached)
COPY package*.json ./ # Layer 3 (cached if unchanged)
RUN npm ci # Layer 4 (cached if package.json unchanged)
COPY . . # Layer 5 (changes every commit)
RUN npm run build # Layer 6 (rebuilds if source changed)
Key insight: Only layers after the first change need rebuilding.
- If you only change TypeScript files: Layers 1-4 cached, only 5-6 rebuild
- If you add npm package: Layers 1-3 cached, 4-6 rebuild
- If you change Dockerfile: Everything rebuilds
With --cache-from:
docker build --cache-from reprise-app:latest ...
- Pulls previous image
- Uses its layers as cache
- Massive speedup (15-20 min → 2-3 min)
GitLab CI Cache
cache:
key: ${CI_COMMIT_REF_SLUG}
paths:
- node_modules/
- .next/cache/
- Caches npm packages between CI jobs
- Speeds up
npm cifrom 2-3 min → 30 sec
Rollback Strategy
Automatic Rollback
If health checks fail, automatic rollback:
# Health check fails
docker-compose down
# Find most recent backup
BACKUP_TAG=$(docker images | grep "reprise-app:backup-" | head -1 | awk '{print $2}')
# Restore backup
docker tag reprise-app:$BACKUP_TAG reprise-app:latest
docker-compose up -d
Rollback time: 10-15 seconds
Manual Rollback
Trigger via GitLab UI:
- Go to CI/CD → Pipelines
- Find the pipeline you want to rollback
- Click "Rollback" manual job
- Restores previous image in 10-15 seconds
Backup Retention
- Keeps last 3 backup images automatically
- Tagged with timestamp:
reprise-app:backup-1697123456 - Older backups auto-deleted after successful deployment
Monitoring Deployment
GitLab Pipeline Logs
Watch deployment progress:
Building Docker images with tag a1b2c3d...
✓ Image built successfully (2m 15s)
Performing rolling update...
✓ New container started (5s)
✓ Health check passed (12s)
✓ Old container stopped (2s)
Deployment complete! (2m 45s total)
Local Monitoring
SSH to server and watch:
# Watch container status
watch -n 1 docker-compose ps
# Follow logs during deployment
docker-compose logs -f app
# Check deployment version
curl https://opportunitydao.app/version.json
Troubleshooting
Build Stage Failures
Problem: Docker build fails in CI
Check:
# View build logs in GitLab
# Common issues:
# - Syntax error in Dockerfile
# - npm install failure
# - Build step failure
Fix:
- Fix the error in code
- Commit and push
- Pipeline auto-retries
Deployment Stage Failures
Problem: Health checks fail
Check:
# SSH to server
docker-compose logs --tail=100 app
# Check container status
docker-compose ps
# Manual health check
curl -v http://localhost:6007/
Fix:
- If database issue: Check migrations
- If app crash: Check logs for errors
- If port conflict: Check what's on port 6007
- Trigger manual rollback if needed
Slow First Build
Problem: First build after changing package.json is slow
Expected: This is normal!
- npm layer cache invalidated
- All subsequent layers rebuild
- Takes 5-7 minutes instead of 2-3 minutes
Not a bug: Subsequent builds will be fast again
Performance Tips
1. Optimize Dockerfile Layer Order
Put least-changed items first:
# ✅ Good - package.json changes less than source code
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# ❌ Bad - invalidates cache on every source change
COPY . .
RUN npm ci
RUN npm run build
2. Use .dockerignore
Exclude unnecessary files from build context:
node_modules
.next
.git
*.log
.env*
Speeds up COPY operations and reduces image size.
3. Combine RUN Commands
# ✅ Good - single layer
RUN apk add --no-cache git python3 make && \
npm ci && \
npx prisma generate
# ❌ Bad - three layers
RUN apk add --no-cache git python3 make
RUN npm ci
RUN npx prisma generate
4. Multi-Stage Builds
Already implemented in our Dockerfile:
depsstage: Install dependenciesbuilderstage: Build applicationrunnerstage: Minimal production image
Result: Final image is much smaller (only runtime files).
Advanced: Blue-Green Deployment
For even more sophisticated deployments, consider blue-green:
# docker-compose.blue-green.yml
services:
app-blue:
image: reprise-app:${BLUE_VERSION}
ports:
- "6007:6007"
app-green:
image: reprise-app:${GREEN_VERSION}
ports:
- "6008:6007"
Process:
- Deploy to green (port 6008)
- Test green endpoint
- Switch load balancer from blue → green
- Keep blue running for instant rollback
Downtime: Literally zero (just LB config change)
Comparison with Other Strategies
Systemd (No Docker)
- Build time: Similar (2-3 min)
- Deployment: Faster (no Docker overhead)
- Rollback: Slower (need git checkout + rebuild)
- Isolation: None (shares host)
Kubernetes
- Build time: Similar
- Deployment: Similar (rolling update)
- Rollback: Instant (change ReplicaSet)
- Overhead: High (needs cluster)
PM2
- Build time: Similar
- Deployment: Faster (no containers)
- Rollback: Manual (git-based)
- Process management: Simpler
Our Docker Compose approach is the sweet spot:
- Fast deployments (2-3 min)
- Minimal downtime (< 5 sec)
- Easy rollback (10-15 sec)
- Good isolation (containers)
- Simple infrastructure (no K8s cluster needed)
Metrics
Track these metrics to monitor deployment health:
- Build time: Should be 2-4 minutes (with cache)
- Deployment time: Should be 30-60 seconds
- Total pipeline time: Should be 3-5 minutes
- Downtime: Should be < 5 seconds
- Rollback time: Should be 10-15 seconds
If these metrics degrade, investigate:
- Cache not working
- Network issues
- Resource constraints on GitLab runner
Upgrading to Docker Compose V2 (Zero-Downtime)
The current setup uses docker-compose v1.29.2, which has a metadata bug preventing true rolling updates. Upgrading to v2 (docker compose plugin) would enable zero-downtime deployments.
Install Docker Compose V2
# SSH to server
ssh opportunitydao@server
# Install docker-compose-plugin (replaces standalone docker-compose)
sudo apt-get update
sudo apt-get install docker-compose-plugin
# Verify installation
docker compose version
# Should show: Docker Compose version v2.x.x
# Update GitLab CI to use 'docker compose' instead of 'docker-compose'
# (with space instead of hyphen)
Update CI/CD Pipeline
Once docker compose v2 is installed, update .gitlab-ci.yml:
# Change all instances of 'docker-compose' to 'docker compose'
- docker compose stop app deposit-processor
- docker compose rm -f app deposit-processor
- docker compose up -d app deposit-processor
# Or enable true rolling update:
- docker compose up -d --no-deps app deposit-processor
# This will start new containers before stopping old ones
Benefits:
- True zero-downtime (old containers run until new ones are healthy)
- No ContainerConfig metadata bugs
- Faster, more reliable deployments
- Better error messages
Next Steps
Immediate (Current Setup)
- Monitor deployment times (should be 3-4 min total)
- Track downtime (should be 30-45 sec)
- Test rollback functionality
Short-term Improvements
- Upgrade to docker compose v2 for zero-downtime
- Add deployment notifications (Slack, Discord)
- Track deployment metrics in Grafana
Medium-term Improvements
- Add smoke tests after deployment
- Add integration tests in CI
- Parallel test execution
Long-term Infrastructure
- Set up Docker registry for image storage
- Add staging environment
- Implement canary deployments (gradual rollout)