Why Straight Lines (Probably) Mean AWS Waste

In most cases, healthy AWS workloads have a tell: they’re noisy. Jittery, even. An internal tool sees usage spike during work hours. A retail site costs more in December. A food delivery app has barely any users at 3am.

The key trait is variability. Real usage is uneven because real demand is uneven.

The inverse is also true. When you see a cost line that’s too smooth, steady, and predictable, something is probably wrong. That single observation led me to catch more waste than AWS Trusted Advisor could ever flag.

Pattern 1: Cost Plateaus

A cost line that holds flat, or drifts only when someone manually intervenes, is worth investigating.

Consider why costs would be flat. Regular canary traffic is the only traffic to the service. An ETL job loads the exact same amount of data. S3 is storing objects that aren’t used.

Now, the obvious objection: if you’re running EC2 instances and RDS databases, your costs should be roughly constant. You’re renting virtual servers.

That’s why, in these cases, the cost line alone isn’t enough. Flat costs on a server that you’re actively using look identical to flat costs on a server nobody has touched in a year. You need utilization metrics alongside cost data. An EC2 instance running at 2% CPU for twelve months tells a very different story than one running between 20-60% within a day. Both cost the same.

The $15,000/mo Plateau

I was reviewing cost data across my organization’s AWS accounts. One account stood out: expensive, but stable. Its monthly cost had barely moved in twelve months.

The account turned out to be a shared sandbox. Before our org had personal developer accounts, all consultants shared a single AWS account to test solutions they were working on for clients. Eventually the org launched personal accounts, and everyone migrated. The shared account stayed running.

The costs weren’t perfectly flat, though. Some months, the total dropped to a new, slightly lower plateau, then held there. I dug into it. The org’s security lead had been asked by the director to deprecate the account. She made progress, deleting some resources where she could. But she wasn’t deeply familiar with the technical side of AWS, and she was cautious. She didn’t want to break a live workload if one existed. So she deleted what felt safe and stopped.

This happens too often. Someone identifies potential waste, but they don’t have the technical context to confirm it’s safe to remove. So it stays. The organization knows it might be a problem, but knowing and resolving are two different things. You need someone to trace the dependencies, confirm nothing is live, and actually pull the trigger.

I followed up with everyone who had access to the account. Nobody was running anything on it; they had all migrated out two years before.

The org spent $360,000 on an account that no one was using.

What was left: 30 unused RDS instances, a dozen EC2 instances running, snapshots eating away at dollars, and DynamoDB tables with provisioned capacity that no one was querying. I deleted the remaining resources and the account’s cost dropped to zero.

Pattern 2: Cost Slopes

The flat-line pattern is the easy one. The harder case is a cost line that’s growing, which makes it look alive, like a healthy system under increasing load.

But mind the shape. If the slope is too consistent, you might be looking at a wasteful system.

Some examples I’ve seen: regular canaries creating a mountain of objects in S3, and a Lambda whose cost rises linearly as its underlying database grows.

A common thread between the cost plateau and the cost slope: there should almost always be a fall. A dip in daily cost, in hourly utilization, in the number of new items in the database. If you don’t see a fall, even a momentary one, you’re looking at a red flag.

When the Audit Logs Cost More Than the ETL Job

We had a weekly ETL job that loaded data from S3 into Redshift. The job ran fine. Redshift looked healthy. The QuickSight dashboard it fed into looked perfect. No one had any complaints.

But I noticed that on Sundays, when the Glue ETL job would happen, its cost spiked - as did CloudTrail’s. Those costs kept climbing on a suspiciously consistent trajectory, reaching new heights, at roughly the same delta, week after week. Eventually, CloudTrail costs per week even outpaced Glue’s costs.

The issue was in how the ETL job worked. It wasn’t loading just the last week of new data. Every week, it scanned the entire S3 bucket: all historical data plus whatever had landed in the last week. The bucket grew as new data arrived. So each run touched more objects than the run before: more ListObjects calls, more GetObject calls. Every one of those API calls got logged by CloudTrail. More objects per run meant more CloudTrail events per day, which meant CloudTrail costs rose in lockstep with the bucket’s growth.

The primary services (S3 and Redshift) looked fine. The job completed, the data loaded, nobody noticed anything wrong on QuickSight. The cost signal showing the problem was in a completely different service that nobody was monitoring for that purpose. That’s what makes this pattern dangerous: look for lines, but they might not be where you expect.

The Takeaway

Straight lines should ring alarm bells.

Real workloads are noisy. If yours isn’t, find out why.