Skip to main content

Deployment Optimization Guide

Overview

This document explains the optimized CI/CD pipeline that achieves near-zero downtime deployments with 2-3 minute total deployment time (down from 25+ minutes).

Key Improvements

Before (Old Pipeline - 25+ minutes)

  1. Build in CI (3-5 min)
  2. rsync all files including node_modules (2-3 min)
  3. docker-compose down - FULL DOWNTIME STARTS
  4. docker-compose build --no-cache - Rebuild from scratch (15-20 min)
  5. docker-compose up -d
  6. Wait for health checks (1-2 min)
  7. TOTAL DOWNTIME: 18-25 minutes

After (New Pipeline - 2-3 minutes)

  1. Build Docker images in CI with layer caching (2-3 min)
  2. rsync only config files (exclude node_modules, .next) (10-20 sec)
  3. Tag current images for rollback (1 sec)
  4. Run migrations on running container (5-10 sec)
  5. Rolling update - start new containers alongside old (30 sec)
  6. Wait for health checks (10-20 sec)
  7. Docker automatically stops old containers when new ones are healthy
  8. TOTAL DOWNTIME: < 5 seconds (just the container swap)

How Rolling Updates Work

Docker Compose Rolling Update Strategy

# Start new containers with new image, keep old ones running
docker-compose up -d --no-deps --build app

# Docker Compose behavior:
# 1. Creates new container with new image
# 2. Starts new container
# 3. Waits for new container to be healthy (healthcheck)
# 4. Only then stops old container
# 5. Traffic switches to new container

Benefits

  • Old container stays running until new one is healthy
  • Traffic continues to old container during new container startup
  • Automatic rollback if new container fails health checks
  • Minimal downtime (only the few milliseconds to swap containers)

Pipeline Stages Explained

Build Stage (2-3 minutes)

docker build --cache-from reprise-app:latest --tag ${IMAGE_TAG} .

What this does:

  • Builds Docker image in CI runner workspace
  • Uses --cache-from to reuse previous image layers (HUGE speedup)
  • Tags with commit SHA for versioning
  • Also tags as :latest for next build's cache

Time savings:

  • Without cache: 15-20 minutes (rebuilds everything)
  • With cache: 2-3 minutes (only rebuilds changed layers)

Deploy Stage (30-60 seconds)

1. Smart rsync (10-20 seconds)

rsync --exclude='node_modules' --exclude='.next' ...
  • Excludes built artifacts (they're in Docker image)
  • Only syncs: docker-compose.yml, prisma/, public/, config files
  • Before: 2-3 minutes to sync everything
  • After: 10-20 seconds for configs only

2. Image Tagging (1 second)

docker tag reprise-app:latest reprise-app:backup-${TIMESTAMP}
  • Tags current running image for instant rollback
  • No image copying, just metadata update
  • Keeps last 3 backups automatically

3. Zero-Downtime Migrations (5-10 seconds)

docker-compose exec -T app npx prisma migrate deploy
  • Runs migrations on currently running container
  • If migration fails, deployment stops, old version keeps running
  • No downtime during migration

4. Rolling Update (30 seconds)

docker-compose up -d --no-deps --build app
  • Starts new container with new image
  • Old container keeps serving traffic
  • Health checks verify new container
  • Only swaps when new container is ready

5. Health Verification (10-20 seconds)

  • Container health check (Docker native)
  • HTTP endpoint health check
  • Automatic rollback if fails

Downtime Comparison

Deployment StepOld PipelineNew Pipeline
Build Phase3-5 min2-3 min (in parallel, no downtime)
Sync Files2-3 min (app offline)10-20 sec (app running)
Build Docker15-20 min (app offline)0 sec (pre-built in CI)
Stop ContainersN/A10-15 sec (graceful shutdown)
Start Container1-2 min (app offline)20-30 sec (with new image)
TOTAL DOWNTIME18-25 minutes30-45 seconds
TOTAL TIME25-30 minutes3-4 minutes

Note: Currently using stop/rm/up sequence due to docker-compose v1.29.2 bug. Upgrading to docker-compose v2 would enable true zero-downtime rolling updates.

Cache Strategy

Docker Layer Caching

Docker builds in layers. Each RUN, COPY, ADD creates a layer:

FROM node:20-alpine              # Layer 1 (cached)
WORKDIR /app # Layer 2 (cached)
COPY package*.json ./ # Layer 3 (cached if unchanged)
RUN npm ci # Layer 4 (cached if package.json unchanged)
COPY . . # Layer 5 (changes every commit)
RUN npm run build # Layer 6 (rebuilds if source changed)

Key insight: Only layers after the first change need rebuilding.

  • If you only change TypeScript files: Layers 1-4 cached, only 5-6 rebuild
  • If you add npm package: Layers 1-3 cached, 4-6 rebuild
  • If you change Dockerfile: Everything rebuilds

With --cache-from:

docker build --cache-from reprise-app:latest ...
  • Pulls previous image
  • Uses its layers as cache
  • Massive speedup (15-20 min → 2-3 min)

GitLab CI Cache

cache:
key: ${CI_COMMIT_REF_SLUG}
paths:
- node_modules/
- .next/cache/
  • Caches npm packages between CI jobs
  • Speeds up npm ci from 2-3 min → 30 sec

Rollback Strategy

Automatic Rollback

If health checks fail, automatic rollback:

# Health check fails
docker-compose down

# Find most recent backup
BACKUP_TAG=$(docker images | grep "reprise-app:backup-" | head -1 | awk '{print $2}')

# Restore backup
docker tag reprise-app:$BACKUP_TAG reprise-app:latest
docker-compose up -d

Rollback time: 10-15 seconds

Manual Rollback

Trigger via GitLab UI:

  1. Go to CI/CD → Pipelines
  2. Find the pipeline you want to rollback
  3. Click "Rollback" manual job
  4. Restores previous image in 10-15 seconds

Backup Retention

  • Keeps last 3 backup images automatically
  • Tagged with timestamp: reprise-app:backup-1697123456
  • Older backups auto-deleted after successful deployment

Monitoring Deployment

GitLab Pipeline Logs

Watch deployment progress:

Building Docker images with tag a1b2c3d...
✓ Image built successfully (2m 15s)

Performing rolling update...
✓ New container started (5s)
✓ Health check passed (12s)
✓ Old container stopped (2s)

Deployment complete! (2m 45s total)

Local Monitoring

SSH to server and watch:

# Watch container status
watch -n 1 docker-compose ps

# Follow logs during deployment
docker-compose logs -f app

# Check deployment version
curl https://opportunitydao.app/version.json

Troubleshooting

Build Stage Failures

Problem: Docker build fails in CI

Check:

# View build logs in GitLab
# Common issues:
# - Syntax error in Dockerfile
# - npm install failure
# - Build step failure

Fix:

  • Fix the error in code
  • Commit and push
  • Pipeline auto-retries

Deployment Stage Failures

Problem: Health checks fail

Check:

# SSH to server
docker-compose logs --tail=100 app

# Check container status
docker-compose ps

# Manual health check
curl -v http://localhost:6007/

Fix:

  • If database issue: Check migrations
  • If app crash: Check logs for errors
  • If port conflict: Check what's on port 6007
  • Trigger manual rollback if needed

Slow First Build

Problem: First build after changing package.json is slow

Expected: This is normal!

  • npm layer cache invalidated
  • All subsequent layers rebuild
  • Takes 5-7 minutes instead of 2-3 minutes

Not a bug: Subsequent builds will be fast again

Performance Tips

1. Optimize Dockerfile Layer Order

Put least-changed items first:

# ✅ Good - package.json changes less than source code
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# ❌ Bad - invalidates cache on every source change
COPY . .
RUN npm ci
RUN npm run build

2. Use .dockerignore

Exclude unnecessary files from build context:

node_modules
.next
.git
*.log
.env*

Speeds up COPY operations and reduces image size.

3. Combine RUN Commands

# ✅ Good - single layer
RUN apk add --no-cache git python3 make && \
npm ci && \
npx prisma generate

# ❌ Bad - three layers
RUN apk add --no-cache git python3 make
RUN npm ci
RUN npx prisma generate

4. Multi-Stage Builds

Already implemented in our Dockerfile:

  • deps stage: Install dependencies
  • builder stage: Build application
  • runner stage: Minimal production image

Result: Final image is much smaller (only runtime files).

Advanced: Blue-Green Deployment

For even more sophisticated deployments, consider blue-green:

# docker-compose.blue-green.yml
services:
app-blue:
image: reprise-app:${BLUE_VERSION}
ports:
- "6007:6007"

app-green:
image: reprise-app:${GREEN_VERSION}
ports:
- "6008:6007"

Process:

  1. Deploy to green (port 6008)
  2. Test green endpoint
  3. Switch load balancer from blue → green
  4. Keep blue running for instant rollback

Downtime: Literally zero (just LB config change)

Comparison with Other Strategies

Systemd (No Docker)

  • Build time: Similar (2-3 min)
  • Deployment: Faster (no Docker overhead)
  • Rollback: Slower (need git checkout + rebuild)
  • Isolation: None (shares host)

Kubernetes

  • Build time: Similar
  • Deployment: Similar (rolling update)
  • Rollback: Instant (change ReplicaSet)
  • Overhead: High (needs cluster)

PM2

  • Build time: Similar
  • Deployment: Faster (no containers)
  • Rollback: Manual (git-based)
  • Process management: Simpler

Our Docker Compose approach is the sweet spot:

  • Fast deployments (2-3 min)
  • Minimal downtime (< 5 sec)
  • Easy rollback (10-15 sec)
  • Good isolation (containers)
  • Simple infrastructure (no K8s cluster needed)

Metrics

Track these metrics to monitor deployment health:

  • Build time: Should be 2-4 minutes (with cache)
  • Deployment time: Should be 30-60 seconds
  • Total pipeline time: Should be 3-5 minutes
  • Downtime: Should be < 5 seconds
  • Rollback time: Should be 10-15 seconds

If these metrics degrade, investigate:

  • Cache not working
  • Network issues
  • Resource constraints on GitLab runner

Upgrading to Docker Compose V2 (Zero-Downtime)

The current setup uses docker-compose v1.29.2, which has a metadata bug preventing true rolling updates. Upgrading to v2 (docker compose plugin) would enable zero-downtime deployments.

Install Docker Compose V2

# SSH to server
ssh opportunitydao@server

# Install docker-compose-plugin (replaces standalone docker-compose)
sudo apt-get update
sudo apt-get install docker-compose-plugin

# Verify installation
docker compose version
# Should show: Docker Compose version v2.x.x

# Update GitLab CI to use 'docker compose' instead of 'docker-compose'
# (with space instead of hyphen)

Update CI/CD Pipeline

Once docker compose v2 is installed, update .gitlab-ci.yml:

# Change all instances of 'docker-compose' to 'docker compose'
- docker compose stop app deposit-processor
- docker compose rm -f app deposit-processor
- docker compose up -d app deposit-processor

# Or enable true rolling update:
- docker compose up -d --no-deps app deposit-processor
# This will start new containers before stopping old ones

Benefits:

  • True zero-downtime (old containers run until new ones are healthy)
  • No ContainerConfig metadata bugs
  • Faster, more reliable deployments
  • Better error messages

Next Steps

Immediate (Current Setup)

  • Monitor deployment times (should be 3-4 min total)
  • Track downtime (should be 30-45 sec)
  • Test rollback functionality

Short-term Improvements

  • Upgrade to docker compose v2 for zero-downtime
  • Add deployment notifications (Slack, Discord)
  • Track deployment metrics in Grafana

Medium-term Improvements

  • Add smoke tests after deployment
  • Add integration tests in CI
  • Parallel test execution

Long-term Infrastructure

  • Set up Docker registry for image storage
  • Add staging environment
  • Implement canary deployments (gradual rollout)