Mohamed ARKID logo

Command Palette

Search for a command to run...

Command Palette

Search for a command to run...

Blog

The Idle EC2 Reaper: Automating Cloud Resource Shutdowns in Non-Production Environments

Your staging and development environments run 24/7 but your engineers work 8 hours a day. Learn how to build a Terraform-managed Lambda function that automatically shuts down idle non-prod resources at night, saving 60% of your non-production compute bill instantly.

Updated 7 min read

TL;DR

Your non-production AWS environments are running 16 hours a day with zero users. That is a 66% waste rate on compute.

  • The Stack: Terraform (infrastructure as code), AWS Lambda (serverless execution), EventBridge (cron scheduling), Python (Lambda runtime).
  • The Verdict: A single afternoon of engineering work can save your company $2,000 to $5,000 per month by shutting down dev/staging resources outside business hours. The ROI is immediate and permanent.

Want this deployed to your AWS account today? I will set up the Reaper, configure the schedules around your team's working hours and time zones, and hand you the Terraform code.

Book a Free Consultation — one afternoon of work, months of savings.


The $3,000 Night Shift

Open your AWS Cost Explorer right now. Filter by environment tag — staging, dev, qa, whatever your team uses. Now look at the 24-hour usage graph.

You will see a flat line. No dips at night. No gaps on weekends. Your staging environment runs exactly the same at 3 AM on a Sunday as it does at 10 AM on a Monday.

Your engineers work roughly 8 hours a day, 5 days a week. That is 40 out of 168 hours per week. Your non-production infrastructure is idle for 76% of every week.

If your non-prod compute bill is $5,000/month, you are burning $3,800 every month on servers that are doing absolutely nothing.

The Solution: The EC2 Reaper

The Reaper is a Lambda function triggered by EventBridge cron rules that:

  1. 7 PM (local time): Stops all EC2 instances, scales down ECS services, and pauses RDS instances tagged with Environment: non-prod.
  2. 8 AM (local time): Starts everything back up before engineers arrive.
  3. Friday 7 PM: Shuts everything down for the weekend.
  4. Monday 8 AM: Wakes everything up for the new work week.

The Lambda Function

reaper.py

import boto3
import logging
 
logger = logging.getLogger()
logger.setLevel(logging.INFO)
 
ec2 = boto3.client("ec2")
rds = boto3.client("rds")
ecs = boto3.client("ecs")
 
 
def get_tagged_instances(action: str) -> list[str]:
    """Find EC2 instances tagged for the Reaper."""
    filters = [
        {"Name": "tag:Environment", "Values": ["dev", "staging", "qa"]},
        {"Name": "tag:Reaper", "Values": ["enabled"]},
    ]
 
    if action == "stop":
        filters.append(
            {"Name": "instance-state-name", "Values": ["running"]}
        )
    elif action == "start":
        filters.append(
            {"Name": "instance-state-name", "Values": ["stopped"]}
        )
 
    response = ec2.describe_instances(Filters=filters)
    instance_ids = []
    for reservation in response["Reservations"]:
        for instance in reservation["Instances"]:
            instance_ids.append(instance["InstanceId"])
 
    return instance_ids
 
 
def stop_rds_instances():
    """Stop all non-prod RDS instances."""
    response = rds.describe_db_instances()
    for db in response["DBInstances"]:
        tags = rds.list_tags_for_resource(
            ResourceName=db["DBInstanceArn"]
        )["TagList"]
        tag_map = {t["Key"]: t["Value"] for t in tags}
 
        if (
            tag_map.get("Environment") in ("dev", "staging", "qa")
            and tag_map.get("Reaper") == "enabled"
            and db["DBInstanceStatus"] == "available"
        ):
            rds.stop_db_instance(
                DBInstanceIdentifier=db["DBInstanceIdentifier"]
            )
            logger.info(f"Stopped RDS: {db['DBInstanceIdentifier']}")
 
 
def start_rds_instances():
    """Start all non-prod RDS instances."""
    response = rds.describe_db_instances()
    for db in response["DBInstances"]:
        tags = rds.list_tags_for_resource(
            ResourceName=db["DBInstanceArn"]
        )["TagList"]
        tag_map = {t["Key"]: t["Value"] for t in tags}
 
        if (
            tag_map.get("Environment") in ("dev", "staging", "qa")
            and tag_map.get("Reaper") == "enabled"
            and db["DBInstanceStatus"] == "stopped"
        ):
            rds.start_db_instance(
                DBInstanceIdentifier=db["DBInstanceIdentifier"]
            )
            logger.info(f"Started RDS: {db['DBInstanceIdentifier']}")
 
 
def scale_ecs_services(desired_count: int):
    """Scale non-prod ECS services to the desired count."""
    clusters = ecs.list_clusters()["clusterArns"]
 
    for cluster_arn in clusters:
        services = ecs.list_services(
            cluster=cluster_arn
        )["serviceArns"]
 
        for service_arn in services:
            service_detail = ecs.describe_services(
                cluster=cluster_arn,
                services=[service_arn],
            )["services"][0]
 
            tags = {
                t["key"]: t["value"]
                for t in service_detail.get("tags", [])
            }
 
            if (
                tags.get("Environment") in ("dev", "staging", "qa")
                and tags.get("Reaper") == "enabled"
            ):
                ecs.update_service(
                    cluster=cluster_arn,
                    service=service_arn,
                    desiredCount=desired_count,
                )
                logger.info(
                    f"Scaled ECS {service_detail['serviceName']} "
                    f"to {desired_count}"
                )
 
 
def handler(event, context):
    action = event.get("action", "stop")
 
    if action == "stop":
        # Stop EC2
        instance_ids = get_tagged_instances("stop")
        if instance_ids:
            ec2.stop_instances(InstanceIds=instance_ids)
            logger.info(f"Stopped EC2 instances: {instance_ids}")
 
        # Stop RDS
        stop_rds_instances()
 
        # Scale ECS to 0
        scale_ecs_services(desired_count=0)
 
        return {
            "action": "stop",
            "ec2_stopped": len(instance_ids),
            "message": "Non-prod environments shut down for the night.",
        }
 
    elif action == "start":
        # Start EC2
        instance_ids = get_tagged_instances("start")
        if instance_ids:
            ec2.start_instances(InstanceIds=instance_ids)
            logger.info(f"Started EC2 instances: {instance_ids}")
 
        # Start RDS
        start_rds_instances()
 
        # Scale ECS back up
        scale_ecs_services(desired_count=1)
 
        return {
            "action": "start",
            "ec2_started": len(instance_ids),
            "message": "Non-prod environments are waking up.",
        }

The Terraform Module

Deploy the Reaper as infrastructure-as-code so it is version controlled, reviewable, and reproducible:

reaper.tf

resource "aws_lambda_function" "reaper" {
  function_name = "ec2-reaper"
  runtime       = "python3.12"
  handler       = "reaper.handler"
  timeout       = 300
  memory_size   = 256
 
  filename         = data.archive_file.reaper_zip.output_path
  source_code_hash = data.archive_file.reaper_zip.output_base64sha256
 
  role = aws_iam_role.reaper_role.arn
 
  environment {
    variables = {
      SLACK_WEBHOOK_URL = var.slack_webhook_url
    }
  }
}
 
# Stop everything at 7 PM UTC on weekdays
resource "aws_cloudwatch_event_rule" "stop_weekday" {
  name                = "reaper-stop-weekday"
  schedule_expression = "cron(0 19 ? * MON-FRI *)"
}
 
resource "aws_cloudwatch_event_target" "stop_weekday" {
  rule = aws_cloudwatch_event_rule.stop_weekday.name
  arn  = aws_lambda_function.reaper.arn
  input = jsonencode({ action = "stop" })
}
 
# Start everything at 8 AM UTC on weekdays
resource "aws_cloudwatch_event_rule" "start_weekday" {
  name                = "reaper-start-weekday"
  schedule_expression = "cron(0 8 ? * MON-FRI *)"
}
 
resource "aws_cloudwatch_event_target" "start_weekday" {
  rule = aws_cloudwatch_event_rule.start_weekday.name
  arn  = aws_lambda_function.reaper.arn
  input = jsonencode({ action = "start" })
}
 
# Stop everything Friday night for the entire weekend
resource "aws_cloudwatch_event_rule" "stop_weekend" {
  name                = "reaper-stop-weekend"
  schedule_expression = "cron(0 19 ? * FRI *)"
}
 
resource "aws_cloudwatch_event_target" "stop_weekend" {
  rule = aws_cloudwatch_event_rule.stop_weekend.name
  arn  = aws_lambda_function.reaper.arn
  input = jsonencode({ action = "stop" })
}
 
# IAM Role with least-privilege permissions
resource "aws_iam_role" "reaper_role" {
  name = "ec2-reaper-role"
 
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
    }]
  })
}
 
resource "aws_iam_role_policy" "reaper_policy" {
  name = "ec2-reaper-policy"
  role = aws_iam_role.reaper_role.id
 
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ec2:DescribeInstances",
          "ec2:StartInstances",
          "ec2:StopInstances",
        ]
        Resource = "*"
        Condition = {
          StringEquals = {
            "ec2:ResourceTag/Reaper" = "enabled"
          }
        }
      },
      {
        Effect = "Allow"
        Action = [
          "rds:DescribeDBInstances",
          "rds:ListTagsForResource",
          "rds:StartDBInstance",
          "rds:StopDBInstance",
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "ecs:ListClusters",
          "ecs:ListServices",
          "ecs:DescribeServices",
          "ecs:UpdateService",
        ]
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = ["logs:*"]
        Resource = "arn:aws:logs:*:*:*"
      },
    ]
  })
}

Tagging Your Resources

The Reaper uses two tags to decide what to manage:

Tag KeyTag ValuePurpose
Environmentdev, staging, or qaIdentifies non-production resources
ReaperenabledOpt-in flag for automatic shutdown

Any resource without both tags is completely ignored. This gives teams fine-grained control — if a specific staging database must stay running overnight for a nightly integration test, just remove the Reaper: enabled tag.

The Operational Reality

  • RDS cold-start delay. Starting a stopped RDS instance takes 5 to 10 minutes. If your engineers arrive at 8 AM sharp and immediately hit the database, they will see connection errors. Set the "start" cron 15 minutes earlier than the earliest engineer's workday.
  • ECS task definition drift. When you scale an ECS service to 0 and back to 1, it uses the current task definition. If someone deployed a broken version at 6:55 PM, the Reaper will dutifully bring up the broken version at 8 AM. Always verify your latest deployment is healthy before end-of-day.
  • Spot interruption overlap. If your non-prod instances are Spot-backed (which they should be for additional savings), AWS may reclaim them before the Reaper stops them. This is harmless — the Reaper will simply find no running instances to stop.
  • Multi-timezone teams. If your engineers span US, Europe, and Asia, set the "start" time to the earliest timezone and the "stop" time to the latest. Or deploy separate Reaper instances per region.

The Math

MetricBefore ReaperAfter Reaper
Non-prod EC2 hours/week16850
Non-prod RDS hours/week16855
Weekly compute utilization100%~30%
Monthly non-prod savings~$2,000 to $5,000

The Reaper pays for itself in the first night it runs.


Your dev and staging environments are burning money while your engineers sleep. A single Terraform module can cut your non-production compute bill by 60% starting tonight.

I will deploy the Reaper to your AWS account and configure it for your team's schedule.

Stop paying Amazon to heat empty servers. Book a Free Consultation.

Get weekly DevOps insights

Join engineers who read my deep-dives on Kubernetes, AWS cost optimization, CI/CD, and infrastructure automation.

Mohamed ARKID
Mohamed ARKID

DevOps Engineer & Cloud Consultant | FinOps, GitOps & Kubernetes Expert

I build systems that run reliably, scale efficiently, and deploy intelligently. See how I can help your team.