I’d been out of DevOps for a bit. Not like “took a sabbatical” out. Like actually out. And when I started looking at jobs again, the resume wasn’t doing the work.
So I built something. A multi-AZ ECS setup on AWS, Terraform all the way down. Something I could actually deploy, demo, and explain under pressure in an interview. Not “Hello World with extra steps” and not infrastructure that only works on my laptop. After some thinking I landed on a REST API on ECS Fargate. Simple enough that interviewers can focus on the infrastructure, not the app logic.
Here’s how it went, including 7 debugging issues that taught me more than any tutorial did.
Architecture: Multi-AZ ECS with Terraform
I decided to build a REST API on AWS ECS Fargate, deployed across three availability zones, with complete Infrastructure as Code automation.
The stack:
- Compute: ECS Fargate with auto-scaling
- Database: RDS PostgreSQL with automated backups
- Load Balancing: Application Load Balancer with health checks
- Networking: VPC with public/private subnets across 3 AZs
- Container Registry: ECR with automated image scanning
- Monitoring: CloudWatch logs, metrics, and dashboards
- IaC: Terraform with modular design
The application: A simple TODO REST API with full CRUD operations. Not because TODOs are exciting, but because they’re familiar enough that interviewers can focus on the infrastructure, not the app logic.
Here’s what the final architecture looks like:
┌─────────────────────────────────────────────────────┐
│ AWS Account │
│ ┌──────────────────────────────────────────────┐ │
│ │ VPC (10.0.0.0/16) │ │
│ │ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │Public AZ-1│ │Public AZ-2│ │Public AZ-3│ │ │
│ │ │ALB + NAT │ │ALB + NAT │ │ NAT │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ │ │
│ │ │ │ │ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │Private AZ1│ │Private AZ2│ │Private AZ3│ │ │
│ │ │ECS Tasks │ │ECS Tasks │ │ RDS │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ Internet → ALB → ECS Fargate → RDS PostgreSQL │
└─────────────────────────────────────────────────────┘
Key decisions:
- ECS Fargate over EC2: No server management, pay per task
- Multi-AZ deployment: Actual high availability, not just a checkbox
- Terraform modules: Reusable components for networking, compute, database
- Docker provider in Terraform: Automated image builds (this was crucial)
What Actually Works
The thing deploys. One terraform apply and 10 minutes later you’ve got a live API. Multi-AZ, auto-scaling, CloudWatch metrics and logs, the whole thing. Response times are 200-300ms consistently. I can hand someone the repo and they can run it.
That’s what I wanted going in.
Fun Stuff I Ran Into During The Build
Issue #1: ECR authentication fails
ECS tasks failing with ResourceInitializationError: unable to pull secrets or registry auth. The ECS task execution role needs explicit ECR permissions even though the docs say AmazonECSTaskExecutionRolePolicy covers it. The managed policy wasn’t enough.
Added explicit permissions to the task execution role:
resource "aws_iam_role_policy" "ecs_task_execution_ecr" {
name = "${var.project_name}-${var.environment}-ecs-ecr-policy"
role = aws_iam_role.ecs_task_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
]
Resource = "*"
}]
})
}
Issue #2: RDS password fun
Terraform apply failed: The parameter MasterUserPassword is not a valid password. Only printable ASCII characters besides '/', '@', '"', ' ' may be used. RDS has password constraints that aren’t obvious anywhere in the docs, and random_password generates whatever it wants by default.
Constrained the character set:
resource "random_password" "db_password" {
length = 32
special = true
override_special = "!#$%&*()-_=+[]{}<>:?"
}
Issue #3: Errors resulting from container builds
Tasks starting but immediately exiting: exit code 255, gunicorn: exec format error. Built the image on an M3 Mac. ARM64 binary, x86_64 Fargate. Classic.
Force x86_64 builds in multiple places:
FROM --platform=linux/amd64 python:3.11-slim
docker build --platform linux/amd64 -t ecs-todo-api:latest .
# In ECS task definition
runtime_platform {
operating_system_family = "LINUX"
cpu_architecture = "X86_64"
}
Issue #4: It ain’t a real project unless DNS fails somewhere lol
Health checks passing, API calls failing: could not translate host name "xxx.rds.amazonaws.com:5432" to address. The RDS endpoint Terraform output includes the port. Pass that as the hostname and DNS breaks because it’s trying to resolve hostname:5432 as a single hostname.
Use db_address instead of db_endpoint:
environment_vars = {
DB_HOST = module.rds.db_address # Hostname only
DB_PORT = "5432" # Port separate
# ...
}
Issue #5: Code oopsies
API returning relation "todos" does not exist even though init_db() was in the code. Code inside if __name__ == '__main__' doesn’t run when Gunicorn imports the module. Only runs when you execute the script directly. Oops.
Move initialization to module load:
app = Flask(__name__)
CORS(app)
# Initialize database on startup (runs when module is imported)
try:
init_db()
logger.info("Database initialized successfully")
except Exception as e:
logger.error(f"Failed to initialize database: {str(e)}")
Issue #6: Where’s my Docker image?
Terraform created all the infrastructure. Tasks wouldn’t start. No Docker image in ECR. Right. I’d been so focused on the infra that I forgot to actually build and push the image. Terraform creates infrastructure, not applications.
Fixed by integrating the Docker provider directly into Terraform:
resource "docker_image" "app" {
name = "${aws_ecr_repository.main.repository_url}:latest"
build {
context = "${path.root}/../app"
dockerfile = "Dockerfile"
platform = "linux/amd64"
}
triggers = {
dockerfile_hash = filemd5("${path.root}/../app/Dockerfile")
app_hash = filemd5("${path.root}/../app/app.py")
}
}
resource "docker_registry_image" "app" {
name = docker_image.app.name
}
Issue #7: Target group health check timing
Tasks starting fine, ALB returning 503 for a few minutes every deploy. Default health check: 30s interval, 5 checks to go healthy. That’s 2.5 minutes before a target is considered healthy even if the app is ready in 10 seconds.
Tuned health check settings:
health_check {
enabled = true
path = "/health"
healthy_threshold = 2 # Down from 5
unhealthy_threshold = 3
timeout = 5
interval = 15 # Down from 30
matcher = "200"
}
Healthy targets in 30 seconds instead of 150. Much better.
Wins: What Works
After all the debugging, here’s what I ended up with:
Complete automation:
terraform apply -var-file=environments/dev.tfvars
# 10 minutes later: fully deployed, working API
Full CRUD operations:
# Health check
$ curl $ALB_URL/health
{
"status": "healthy",
"environment": "dev",
"version": "1.0.0"
}
# Create TODO
$ curl -X POST $ALB_URL/api/todos \
-H "Content-Type: application/json" \
-d '{"title":"Deploy to production","completed":false}'
# Response time: 200-300ms consistently
Auto-scaling:
- CPU-based scaling: Target 70% utilization
- Memory-based scaling: Target 80% utilization
- Scale from 1 to 10 tasks automatically
Monitoring:
- CloudWatch dashboard with ECS, ALB, and RDS metrics
- Alarms for CPU, memory, and database connections
- Centralized logging with structured output
The Numbers: How much is this gonna cost?
One thing tutorials and blogs almost never talk about: cost. Here’s the breakdown:
Monthly costs (dev environment):
- ECS Fargate (1 task, 0.25 vCPU, 0.5GB): ~$12
- RDS db.t4g.micro, 20GB: ~$15
- Application Load Balancer: ~$20
- NAT Gateways (3 for HA): ~$105
- Data transfer & CloudWatch: ~$8
- Total: ~$160/month
Cost optimization pro tips:
- NAT Gateways are expensive - consider using 1 instead of 3 for dev
- Fargate pricing is predictable - no surprise costs from scaling
- RDS automated backups add minimal cost
- Scaling to 0 tasks saves ~$12/month when not demoing
For a production environment with 2+ tasks and db.t4g.small, budget ~$250/month.
Stuff Worth Knowing
Tutorials teach you syntax. Debugging production issues teaches you engineering. Not a profound insight, but it’s true every time.
Managed policies don’t always work. Default settings aren’t always right. The ARM vs x86 thing, the WSGI module import thing, the managed policy thing: the docs said it should work, it didn’t. Verify everything.
The Docker-in-Terraform integration took extra time up front but saved it on every deploy after. If you’re going to run the same thing 20 times, automate it on run two.
If You’re Building Your Portfolio Too
Build something you can actually deploy and demo, not a local-only thing. Document the stuff that breaks. Understand what it costs to run.
That’s it. The rest you figure out by doing it.
Future enhancements
Potential improvements for production readiness:
Short term:
- HTTPS with ACM certificate
- Custom domain with Route53
- Blue/green deployments with CodeDeploy
Medium term:
- WAF for additional security
- Database backup/restore procedures
- Integration tests in CI/CD
Long term:
- Multi-region deployment
- Disaster recovery testing
- Performance optimization and load testing
Final Thoughts
I went from “I’ve been out of the field for a bit” to “here’s a live URL, here’s the repo, and here are seven production issues I debugged along the way.”
That conversation is a lot better than handing someone a resume.
Resources
GitHub Repository
Live Demo: Available upon request (may be scaled down to save costs)
Tech Stack: AWS (ECS, RDS, VPC, ALB, ECR), Terraform, Docker, Python/Flask, PostgreSQL
Connect with me:
Built this project? Have questions about the debugging process? Found a better approach? I’d love to hear from you. Shoot me an email or reach out on LinkedIn.
