100 TPS Is Not the Hard Part
We recently ran a sustained load test on MasonXPay and held 100 transactions per second through the payment core — auth, capture, idempotency checks, ledger writes, the full path. That’s our average throughput target; payment traffic isn’t flat, and peak hours can spike to 1,000 TPS or higher. For a payment orchestration platform targeting cross-border merchants in Southeast Asia, a 100 TPS average is a meaningful production baseline.
Then I started doing the math on what these numbers actually mean operationally, and the milestone started to feel like the easy part.
The Complexity Is Not at the Entry Point
When engineers think about scaling a payment system, they instinctively think about the API layer: request throughput, database write latency, connection pooling, and queue depth. Those are real problems with well-understood solutions — horizontal scaling, read replicas, async processing. Hard, but tractable.
The part that keeps payment engineers awake is everything that happens after the transaction is accepted. The business logic is stable. The API is fast. And money still disappears.
Here’s where it actually goes wrong.
Reconciliation: The Batch Job That Owns Your Integrity
At 100 TPS, you’re processing roughly 8.6 million transactions per day. Every one of those needs to be reconciled against the provider’s settlement report — a batch file that arrives once a day, sometimes with a 24–48 hour lag, in a format that varies by provider and by their mood.
Reconciliation is where you find out that the transaction you recorded as CAPTURED at 14:03:17 UTC appears in the settlement file as SETTLED at a different amount (after a fee adjustment you weren’t notified about), under a different reference ID (because the provider’s system re-keyed it during a failover), on a different date (because their settlement window closed in a different timezone).
At low volume, this is manageable. At 8.6M transactions a day, a 0.01% mismatch rate is 860 transactions with a discrepancy, every single day. Each one requires human judgment or an automated resolution path you had to build, test, and maintain.
Provider Instability: The Unknown Transaction State
Payment providers are not as reliable as databases are. They time out. They return 200 OK with an error in the body. They return a network error after they’ve already processed the charge. They go into maintenance windows at peak hours with 20 minutes’ notice.
The dangerous case is the unknown state: you sent a charge request, the connection dropped before you got a response, and you genuinely do not know whether the money moved. Your idempotency key protects you from double-charging on retry — if the provider honors it correctly, which not all of them do consistently across failover events.
At 100 TPS, even a 30-second provider brownout means 3,000 transactions in an ambiguous state. Each one needs to be resolved by querying the provider’s status API. But what if that query also fails, or returns “transaction not found” for two hours while their systems recover?
The correct pattern is: query with exponential backoff (1s → 2s → 4s → 8s → 30s), and after N failed queries, force the transaction into a manual review queue — do not auto-retry the charge. Only mark a transaction FAILED after the provider confirms it in writing or the settlement file shows its absence. The distinction between unknown and confirmed failed is where most gateways lose money: collapsing unknown into failed causes double charges on retry; leaving unknown open forever causes float leakage.
Idempotency key TTL mismatch compounds this further. Your system may honor idempotency keys for 30 days; many providers expire them after 24 hours. A client retrying on day 2 with the same key gets a cached success response from your system, while the provider treats it as a new request and charges again. Normalize your idempotency key TTL to the minimum of your system’s and each provider’s window.
This is not a systems design problem — it’s an operational playbook problem, and it has to be built before the incident, not during it.
Fee Changes Without Notification
Payment providers change their fee structures. Sometimes they send an email three weeks in advance. Sometimes you find out when your margin report looks wrong.
Interchange fees, cross-border surcharges, currency conversion spreads, and monthly minimums — these all feed into the unit economics of every transaction. When a fee table changes and your system hasn’t been updated, you’re either undercharging merchants (funds loss on your side) or overcharging them (regulatory and trust exposure).
At scale, a 0.05% fee shift on 8.6M daily transactions is a material daily loss before anyone notices. Building a fee configuration system that can be updated without a deployment, with audit trails and effective-date logic, is not glamorous work. It’s also not optional.
Exchange Rate Exposure
Multi-currency payments introduce FX risk that compounds with volume. The exchange rate at authorization time is not the rate at capture time. The rate at capture is not the rate when the provider settles. If you’re holding a float in one currency and paying out in another, every hour of settlement lag is FX exposure.
The worst case isn’t the 24-hour settlement lag — it’s the auth-to-capture time gap. Hotel pre-authorizations, car rental holds, and marketplace escrow can sit open for days or weeks. EUR/USD moved 8% in a single quarter in 2022. A hotel pre-authorizing $5,000 for a two-week stay and capturing on checkout carries real FX exposure across that window, not just overnight.
At 100 TPS across a mix of currencies, you are running a small, unintentional FX desk. You need to know your net exposure by currency pair at any given moment, what your hedge threshold is, and what happens to your P&L when a currency moves 2% intraday. Most of this risk is invisible until it materializes in the reconciliation report.
Chargebacks: The Problem With a Deadline
Reconciliation mismatches are financial discrepancies. Chargebacks are compliance obligations with hard deadlines — and they’re harder to manage at scale than anything above.
At 8.6M transactions/day, a 0.1% dispute rate (conservative for card-not-present) is 8,600 chargebacks per day. Each requires:
- An evidence package submitted within 30 days (varies by card network and issuer)
- Merchant notification within your contractual SLA
- Provisional credit issuance or reversal
- Win/loss tracking that feeds back into your fraud model
Miss the response window, and you lose automatically — regardless of whether the transaction was legitimate. At volume, chargeback management becomes its own operational subdomain: intake pipeline, evidence assembly, merchant communication, and network-specific formatting rules. It does not fit into a generic “operations” backlog.
The Math on Five Nines
Here is the number that reframed this for me.
99.999% transaction success rate sounds bulletproof. But do the math first: 0.001% failure rate on 8.64M daily transactions is 86 transactions/day with unresolved outcomes — roughly 2,600 per month. That’s the floor, at average load, before any of the above problems occur.
Now apply volume. The failure rate is only one input; the other is traffic mode.
Baseline assumptions for the calculations below:
- Average transaction value: $50
- Average TPS: 100 → 8.64M transactions/day → $432M daily volume
- Peak TPS: 1,000 (10× spike, realistic during business hours in high-traffic markets)
- Peak window: 2 hours/day → 7.2M transactions in that window alone
At peak, the daily blended volume climbs to roughly **50).
Provider Brownout: 30-Second Outage
A 30-second brownout is not a disaster — it’s a Tuesday.
| Average (100 TPS) | Peak (1,000 TPS) | |
|---|---|---|
| Ambiguous transactions | 3,000 | 30,000 |
| Funds in unknown state | $150,000 | $1,500,000 |
The difference between average and peak is a 10× jump in ambiguous funds — in the same 30 seconds.
Reconciliation Mismatch: 0.01% Error Rate
This is a conservative rate. Real-world reconciliation against multi-provider settlement files can run higher.
| Average (100 TPS) | Peak (1,000 TPS) blended day | |
|---|---|---|
| Daily transactions | 8.64M | ~15.1M |
| Mismatched transactions/day | 864 | 1,512 |
| Funds with discrepancy/day | $43,200 | $75,600 |
| Monthly accumulation | $1.3M | $2.3M |
These aren’t losses yet — they’re open items. Unresolved items age into actual write-offs.
Silent Fee Change: 0.05% Undetected Shift
A provider changes their cross-border surcharge. You will find out in two weeks.
| Average (100 TPS) | Peak (1,000 TPS) | |
|---|---|---|
| Daily volume | $432M | $756M |
| Daily funds lost | $216,000 | $378,000 |
| Two-week exposure | $3.0M | $5.3M |
There is no webhook for “your margin just changed.”
FX Exposure: 2% Intraday Move on Multi-Currency Float
Assume 30% of volume is multi-currency, with a 24-hour settlement lag.
| Average (100 TPS) | Peak (1,000 TPS) | |
|---|---|---|
| FX daily volume | $129.6M | $226.8M |
| Worst-case 2% move | $2.6M | $4.5M |
This is tail risk, not expected loss. But it’s not hypothetical — emerging market currencies move 2% intraday on macro events more often than engineers expect.
The Full Picture
On an ordinary day at average load, the operational surface area of 100 TPS represents:
- ~$150K in brownout-ambiguous funds per incident
- ~$43K in daily reconciliation discrepancies
- ~$216K/day quietly bleeding from an undetected fee change
- Up to $2.6M in FX tail exposure
- Thousands of chargebacks with hard response deadlines
At peak load (1,000 TPS), every number is 1.5–10× larger. The system didn’t fail. The business layer held 99.999%. The funds still need to be found.
What This Changes About How We Build
The lesson isn’t that 100 TPS is an illusion — it’s a real and necessary milestone. The lesson is that throughput is table stakes, not the destination.
The architecture that handles 100 TPS cleanly is not the same architecture that handles the operational surface area of 100 TPS cleanly. Reconciliation pipelines, provider abstraction layers with state resolution logic, fee configuration systems, FX exposure monitoring, chargeback intake pipelines — these are not scaling problems in the traditional sense. They don’t show up on a load test. They show up in the middle of the night, in a batch job, as a number that doesn’t match.
The payment API is the door. Reconciliation, state resolution, fee monitoring, FX exposure, chargebacks — that’s the building. Most engineers design the door and call it the building.