I’d been out of DevOps for a bit. Not like “took a sabbatical” out. Like actually out. And when I started looking at jobs again, the resume wasn’t doing the work.

So I built something. A multi-AZ ECS setup on AWS, Terraform all the way down. Something I could actually deploy, demo, and explain under pressure in an interview. Not “Hello World with extra steps” and not infrastructure that only works on my laptop. After some thinking I landed on a REST API on ECS Fargate. Simple enough that interviewers can focus on the infrastructure, not the app logic.

Here’s how it went, including 7 debugging issues that taught me more than any tutorial did.

Architecture: Multi-AZ ECS with Terraform

I decided to build a REST API on AWS ECS Fargate, deployed across three availability zones, with complete Infrastructure as Code automation.

The stack:

The application: A simple TODO REST API with full CRUD operations. Not because TODOs are exciting, but because they’re familiar enough that interviewers can focus on the infrastructure, not the app logic.

Here’s what the final architecture looks like:

┌─────────────────────────────────────────────────────┐
│                    AWS Account                       │
│  ┌──────────────────────────────────────────────┐  │
│  │ VPC (10.0.0.0/16)                             │  │
│  │                                                │  │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐ │  │
│  │  │Public AZ-1│  │Public AZ-2│  │Public AZ-3│ │  │
│  │  │ALB + NAT  │  │ALB + NAT  │  │    NAT    │ │  │
│  │  └───────────┘  └───────────┘  └───────────┘ │  │
│  │        │              │              │         │  │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐ │  │
│  │  │Private AZ1│  │Private AZ2│  │Private AZ3│ │  │
│  │  │ECS Tasks  │  │ECS Tasks  │  │    RDS    │ │  │
│  │  └───────────┘  └───────────┘  └───────────┘ │  │
│  └──────────────────────────────────────────────┘  │
│                                                     │
│  Internet → ALB → ECS Fargate → RDS PostgreSQL    │
└─────────────────────────────────────────────────────┘

Key decisions:

What Actually Works

The thing deploys. One terraform apply and 10 minutes later you’ve got a live API. Multi-AZ, auto-scaling, CloudWatch metrics and logs, the whole thing. Response times are 200-300ms consistently. I can hand someone the repo and they can run it.

That’s what I wanted going in.

Fun Stuff I Ran Into During The Build

Issue #1: ECR authentication fails

ECS tasks failing with ResourceInitializationError: unable to pull secrets or registry auth. The ECS task execution role needs explicit ECR permissions even though the docs say AmazonECSTaskExecutionRolePolicy covers it. The managed policy wasn’t enough.

Added explicit permissions to the task execution role:

resource "aws_iam_role_policy" "ecs_task_execution_ecr" {
  name = "${var.project_name}-${var.environment}-ecs-ecr-policy"
  role = aws_iam_role.ecs_task_execution.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ]
      Resource = "*"
    }]
  })
}

Issue #2: RDS password fun

Terraform apply failed: The parameter MasterUserPassword is not a valid password. Only printable ASCII characters besides '/', '@', '"', ' ' may be used. RDS has password constraints that aren’t obvious anywhere in the docs, and random_password generates whatever it wants by default.

Constrained the character set:

resource "random_password" "db_password" {
  length  = 32
  special = true
  override_special = "!#$%&*()-_=+[]{}<>:?"
}

Issue #3: Errors resulting from container builds

Tasks starting but immediately exiting: exit code 255, gunicorn: exec format error. Built the image on an M3 Mac. ARM64 binary, x86_64 Fargate. Classic.

Force x86_64 builds in multiple places:

FROM --platform=linux/amd64 python:3.11-slim
docker build --platform linux/amd64 -t ecs-todo-api:latest .
# In ECS task definition
runtime_platform {
  operating_system_family = "LINUX"
  cpu_architecture        = "X86_64"
}

Issue #4: It ain’t a real project unless DNS fails somewhere lol

Health checks passing, API calls failing: could not translate host name "xxx.rds.amazonaws.com:5432" to address. The RDS endpoint Terraform output includes the port. Pass that as the hostname and DNS breaks because it’s trying to resolve hostname:5432 as a single hostname.

Use db_address instead of db_endpoint:

environment_vars = {
  DB_HOST = module.rds.db_address  # Hostname only
  DB_PORT = "5432"                 # Port separate
  # ...
}

Issue #5: Code oopsies

API returning relation "todos" does not exist even though init_db() was in the code. Code inside if __name__ == '__main__' doesn’t run when Gunicorn imports the module. Only runs when you execute the script directly. Oops.

Move initialization to module load:

app = Flask(__name__)
CORS(app)

# Initialize database on startup (runs when module is imported)
try:
    init_db()
    logger.info("Database initialized successfully")
except Exception as e:
    logger.error(f"Failed to initialize database: {str(e)}")

Issue #6: Where’s my Docker image?

Terraform created all the infrastructure. Tasks wouldn’t start. No Docker image in ECR. Right. I’d been so focused on the infra that I forgot to actually build and push the image. Terraform creates infrastructure, not applications.

Fixed by integrating the Docker provider directly into Terraform:

resource "docker_image" "app" {
  name = "${aws_ecr_repository.main.repository_url}:latest"
  
  build {
    context    = "${path.root}/../app"
    dockerfile = "Dockerfile"
    platform   = "linux/amd64"
  }

  triggers = {
    dockerfile_hash = filemd5("${path.root}/../app/Dockerfile")
    app_hash        = filemd5("${path.root}/../app/app.py")
  }
}

resource "docker_registry_image" "app" {
  name = docker_image.app.name
}

Issue #7: Target group health check timing

Tasks starting fine, ALB returning 503 for a few minutes every deploy. Default health check: 30s interval, 5 checks to go healthy. That’s 2.5 minutes before a target is considered healthy even if the app is ready in 10 seconds.

Tuned health check settings:

health_check {
  enabled             = true
  path                = "/health"
  healthy_threshold   = 2  # Down from 5
  unhealthy_threshold = 3
  timeout             = 5
  interval            = 15  # Down from 30
  matcher             = "200"
}

Healthy targets in 30 seconds instead of 150. Much better.

Wins: What Works

After all the debugging, here’s what I ended up with:

Complete automation:

terraform apply -var-file=environments/dev.tfvars
# 10 minutes later: fully deployed, working API

Full CRUD operations:

# Health check
$ curl $ALB_URL/health
{
  "status": "healthy",
  "environment": "dev",
  "version": "1.0.0"
}

# Create TODO
$ curl -X POST $ALB_URL/api/todos \
  -H "Content-Type: application/json" \
  -d '{"title":"Deploy to production","completed":false}'

# Response time: 200-300ms consistently

Auto-scaling:

Monitoring:

The Numbers: How much is this gonna cost?

One thing tutorials and blogs almost never talk about: cost. Here’s the breakdown:

Monthly costs (dev environment):

Cost optimization pro tips:

For a production environment with 2+ tasks and db.t4g.small, budget ~$250/month.

Stuff Worth Knowing

Tutorials teach you syntax. Debugging production issues teaches you engineering. Not a profound insight, but it’s true every time.

Managed policies don’t always work. Default settings aren’t always right. The ARM vs x86 thing, the WSGI module import thing, the managed policy thing: the docs said it should work, it didn’t. Verify everything.

The Docker-in-Terraform integration took extra time up front but saved it on every deploy after. If you’re going to run the same thing 20 times, automate it on run two.

If You’re Building Your Portfolio Too

Build something you can actually deploy and demo, not a local-only thing. Document the stuff that breaks. Understand what it costs to run.

That’s it. The rest you figure out by doing it.

Future enhancements

Potential improvements for production readiness:

Short term:

Medium term:

Long term:

Final Thoughts

I went from “I’ve been out of the field for a bit” to “here’s a live URL, here’s the repo, and here are seven production issues I debugged along the way.”

That conversation is a lot better than handing someone a resume.


Resources

GitHub Repository
Live Demo: Available upon request (may be scaled down to save costs)
Tech Stack: AWS (ECS, RDS, VPC, ALB, ECR), Terraform, Docker, Python/Flask, PostgreSQL

Connect with me:


Built this project? Have questions about the debugging process? Found a better approach? I’d love to hear from you. Shoot me an email or reach out on LinkedIn.