The "I Forgot to Delete It" Problem

Someone forgot to delete the Kendra instance. The bill, meanwhile, doesn’t forget anything.

“I forgot” isn’t something you can solve through wise words. “Don’t forget next time” isn’t governance. But in trying to solve the problem, organizations too often overcompensate and destroy developer velocity in the process.

Why Does This Keep Happening?

The forgotten Kendra instance ran for 3 months, chewing through $12k in total. Doesn’t matter who did it, doesn’t matter why they did it. AWS Cost Anomaly Detection didn’t complain to anyone. It’s designed for sharp spikes against a service’s own history, not for a new mid-sized resource that blends into an account’s existing usage.

I found out when I was poking around the team’s bill, including the bills wrought by other subteams. Besides telling the dev to shut it down, I got consensus to add a line to the team’s monthly operational metrics review: month-over-month cost delta. Every leader, manager, and staff member would see that number. And if it was too high, it was someone’s responsibility to explain it.

Besides being a crude detection mechanism, this change had positive cultural effects. Developers started being more careful about cleaning up the messes they made. The metric became the goal, and average MoM cost deltas dropped by roughly a third. Design documents included “Cost Considerations” as a first-class concern.

I was moved to another team. After a while, I was wondering how my old team’s costs were holding up.

In the span of one month, costs doubled. They increased by 7% on average for 5 months afterward. I checked the team’s Slack channel and that month’s operational metrics report. No one said anything about any of it.

What I Should’ve Done

What happened was cultural decay. Because I pushed for it while I was there, the team believed costs mattered. After I left, that belief weakened. Someone convinced the team that doubled costs were perfectly acceptable. Maybe they were right. Perhaps a new feature justified the cost. Our org’s director didn’t use top-down pressure to keep costs in line, and engineering resources were scarce.

But just because cost isn’t a priority today doesn’t mean you should forget about increases. The problem with my approach was that it depended on a person. When the person left, the process left with them.

Rather than relying on human detection and enforcement, build a Lambda that queries the Cost Explorer API, compares last month’s actual spend to this month’s forecasted spend, and creates a ticket if the delta exceeds a threshold you’re comfortable with. Run it daily. The forecast updates as new usage comes in, so a forgotten resource shows up the day after it starts. 3% is a reasonable starting point, but your team’s normal growth rate should guide that number - bursty workloads will push the forecast around, so tune to your team’s volatility or you’ll train people to ignore the ticket.

The specific threshold matters less than where the ticket lands. Put it in your team’s existing ticket queue, the same one leadership already reviews for engineering debt. I’ve found that large ticket queues receive disproportionate attention from leaders, because a backlog of 100 tickets represents far more organizational cost than $10k/month in infrastructure. Leaders already value small ticket queues. By making cost anomalies show up there, they inherit that priority. You don’t need to convince anyone that infrastructure cost matters. You just need to attach it to something they already care about.

Detection is Better than Prevention

Many solutions to the “I forgot” problem resolve it by focusing on prevention. But that’s like reducing car accidents by banning everyone but professional drivers from the road. It’s absolutely going to work. It’s also going to make everyone’s lives more miserable.

Service Control Policies

Some organizations simply forbid manual changes in all AWS accounts. Preventing developers from modifying shared pre-prod and prod environments is critical for environment parity. That’s reasonable.

But if developers don’t have their own AWS accounts to experiment in, this is crippling. A developer wants to try different EC2 instance types for profiling. They want to manually trigger a CloudWatch Alarm to test a downstream process. A blanket “no manual changes” policy makes all of this a multi-day ordeal instead of a 20-minute experiment.

Infrastructure-as-Code (IaC) Only

Similarly, if a developer needs to write a Terraform module before they can test whether a service even does what they need, you’ve turned a quick experiment into a half-day project. IaC is the right answer for production. It’s the wrong answer for “I want to see what this button does.”

Tags

Tag key: owner. Tag value: jsmith. Reassuring to see on a resource. If you’re not sure whether it should still exist, now you know who to ask.

This is the wrong solution to the attribution problem. AWS accounts are free. AWS Organizations can standardize policies across hundreds of them. Give every developer their own account. The account is the tag. There’s no question about owner, team, or environment when the account belongs to one person.

Mandatory tagging makes sense when individual accounts are impractical. The application might be too expensive or too entangled to deploy per-developer. But that’s the exception, not the default. And even when you enforce tagging, tags rot. owner: jsmith stops being useful when jsmith leaves the company eight months later and nobody updates the tag.

TTL / Regular Auto-nuke

Tools like aws-nuke on a schedule sound clean: wipe every dev account weekly, catch anything left behind. For a single-service environment, this might work.

But for a microservice architecture, deploying the full application stack can take an entire day. If you’re nuking weekly, you’re asking developers to spend 20% of their working time redeploying an environment they already had.

How Much Does Forgetting Cost, Anyway?

The best argument for prevention over detection is if your environment is genuinely so expensive that even a single day of forgotten resources is catastrophic. That’s a rare case.

A developer spins up an r6g.2xlarge, a serious instance, and forgets about it. At $0.4032/hour, that’s about $9.70 per day. Caught by a daily cost monitor within 24 hours (such as the one previously suggested), you’ve lost less than $10. Scale this up 10x to something meatier - a forgotten OpenSearch cluster, a SageMaker endpoint, another Kendra instance - and the math still favors detection.

Now consider the cost of prevention. A mid-level engineer’s fully loaded cost is roughly $150,000–$200,000 a year. That’s $600–$800 per working day. One day spent writing an SCP, debugging why a deploy was blocked by a missing tag, or redeploying an environment after a scheduled nuke costs more than the forgotten instance running for a full month.

The math almost always favors detection. The cure is more expensive than the disease.

Conclusion: Forget and Detect

Your developers are going to forget. Accept it. The question is whether you find out in a day or in three months.