We were brought in on a payments platform with a problem the on-call rotation could recite from memory: duplicate charges, doubled-up ledger entries, occasional missing transfers. The kind of bugs that cost real money and real trust.
The team had three suspect subsystems and no consensus on which one was at fault. We spent two weeks reading code and tracing requests before proposing anything.
The diagnosis was not exciting. It was everywhere.
The Pattern Beneath The Bugs
Every duplicate-charge incident had the same shape, regardless of which service was implicated:
A client (or another internal service) made a write request
The request succeeded server-side, but the response was lost — connection drop, timeout, gateway 502, mobile network glitch
The client retried
The server happily processed the second request as a new write
Every service in the platform did some version of this. Some had partial mitigations. None had a uniform contract.
The Fix Was An Architecture Decision, Not A Code Fix
We did not patch the eight services that exhibited the bug. We added one platform primitive:
Every write endpoint, internal or external, must accept and honor an
Idempotency-Keyheader.
That is the rule. It applies to:
The public API (already had partial support)
The internal RPC mesh (had nothing)
The async job processors (had nothing)
The webhook outbound delivery system (had nothing)
We built a small middleware that:
1. Reads the Idempotency-Key from the request
2. Hashes it with the route + caller identity
3. Looks up the result in a fast key-value store
4. If found within the TTL → return the cached response, do not re-execute
5. If not → execute, cache the response, return
The KV store is Cloudflare KV with a 24-hour TTL for external clients, 30 minutes for internal RPC.
What Changed
Within the first week of deployment, recurring duplicate-charge incidents dropped to zero. The on-call burden dropped roughly 60% in the first month — most of the remaining alerts were stale and got cleaned up shortly after.
The economic impact was easier to measure than we expected: the finance team had been writing off roughly $40k/month in unrecoverable duplicate transfers. That number went to zero.
The Engineering Lessons
Cross-cutting concerns belong at the platform layer, not in eight different service codebases. If you find yourself fixing the same class of bug in multiple places, you are fixing the wrong layer.
Idempotency keys are cheap to add and almost impossible to retrofit later. If you are starting a new service today, build this in before the first endpoint ships. We now treat it as table stakes alongside auth and rate limiting.
Naming a contract is more important than implementing it. Once "every write endpoint accepts Idempotency-Key" was a written rule, the team enforced it in PR review without further coordination from us.
The boring fix wins. The team had been considering distributed transactions, saga frameworks, two-phase commits. None of those would have helped. A 24-hour cache of recent request hashes did.
What Did Not Make It Into The Final Architecture
We considered and rejected:
Database-level uniqueness constraints — too coupled to specific writes, doesn't help with non-DB side effects (webhooks, external API calls).
A central transaction coordinator — too much new infrastructure, too brittle.
Eventual consistency reconciliation jobs — band-aids that ship the problem to a different team.
If your platform has the same bug in eight places, the bug is not in the code. It is in the contract.
If your team is debating a saga framework, there might be a simpler answer.


