Saving Money on Bedrock is Hard

Anthropic publishes clear, seemingly indisputable token costs. For Opus, it’s $5 / million input, $25 / million output. Bedrock inherits this pricing (mostly - see Cross-region inference), along with the same cost levers Anthropic offers on its own platform: prompt caching and batch inference.

You’re used to a dozen cost levers on EC2. Bedrock gives you about three that matter. They’re harder to implement than rightsizing an instance, but they’re necessary, given how expensive AI workloads are.

Downgrade Your Model

This is the most important way to save money. It’s 2026. Sonnet 4.6 practically matches Opus 4.5, a model that was released 4 months before Sonnet 4.6.

I’d argue there are only two possible justifications to use the smartest model available. One would be if you genuinely need deep knowledge and intelligence. Research, financial modeling, simulations, complex system design. A simpler model might give you a great answer, but if you require the best answer, use the best model.

The other case is where the model takes legitimately consequential decisions. Fraud, loan judgments, contract reviews. But even then, an optimized prompt and careful thinking on the model’s part are far more impactful than a simple model upgrade.

I’ll be clearer. If you are using Bedrock for a chatbot, academic essay writing, marketing, general education, text summarization, or transcription, you should downgrade your model. You don’t need Opus. Enjoy your 40% discount.

Cross-region Inference

Geographic cross-region inference (like the profile us.anthropic.claude-sonnet-4-6) costs 10% more. That’s right. The lower rates on the Bedrock pricing page are the global inference rates. Geographic and in-region inference costs 10% more. If you want to restrict traffic to a specific geographic region, perhaps for latency or compliance, you’re paying for that restriction. Use global inference if you can tolerate it.

Prompt Caching

When I heard about this feature, I was quite confused. “Why would I want to send the same prompt to Bedrock multiple times?”

That’s not what it does, thankfully. You set a cachePoint after your unchanging system prompt. Bedrock caches everything above it. Cache misses result in a write premium: 1.25x for the default 5-minute TTL, 2x for the 1-hour TTL. But cache hits mean you’re paying 90% less for your system prompt. In a world where input tokens very often outnumber output tokens, this results in enormous savings.

Prompt Routers

This isn’t a good option anymore. AWS previously hyped these up, positioning prompt routes as a way to use the right model for the job while lowering cost. However, prompt routers can be configured only with older models - Sonnet / Haiku 3.5, Nova Lite / Pro v1, Llama 3.3. I wouldn’t be surprised if AWS deprecates this feature.

Provisioned Throughput

This is your Savings Plan for Bedrock. You commit to spending a certain amount on a model for either 1 month or 6 months. The selection of models is not impressive.

This is a hard sell. Unless you have a stable document processing pipeline or a global user base that invokes Bedrock around the clock, you’ll be paying for capacity you don’t use.

It’s worth minding that the committed throughput is per-region. Say you commit to $10/hr of Sonnet 4.6 in us-east-1. You’d likely give up cross-region inference to concentrate requests in the region where you provisioned throughput. That means paying the 10% geographic surcharge, losing the resilience of multi-region routing, and betting your uptime on a single region of a service that is not known for its stability.

I can’t recommend this, unless you’re operating at a scale where you can commit to provisioned throughput in multiple regions.

Batch Inference

50% discount, but you get the results within 24 hours. That’s slow. Not many use cases can accommodate that. Perhaps an internal document processing tool that can run on-demand over weekdays and batch on weekends. Or a resume/application reviewer.

Conclusion

Unlike with more mature services, cost savings measures in Bedrock are still underdeveloped. There aren’t many practical levers for a majority of workloads. But the few that exist - especially model selection and prompt caching - can yield savings exceeding 50%.

Most of the savings levers discussed here (model selection, prompt caching, batch inference) apply equally to the Anthropic API. The Bedrock-specific levers are cross-region inference, provisioned throughput, and prompt routers, none of which I’d call compelling.