At 02:13 a.m., a colleague from a nearby fintech team sent the kind of message nobody wants during a launch:
"Checkout latency is exploding. Kafka lag is climbing every second."
An influencer campaign had generated roughly five times the expected traffic. Within minutes, the payment path became the bottleneck.
This was not a story about Kafka being bad. Kafka is excellent when it is operated well and matched to the workload.
This was a story about a self-managed streaming cluster sitting directly on the payment-critical path without enough safety margin.
What Broke
The first visible symptom was consumer lag. Messages were arriving faster than payment workers could clear them, and the backlog passed one million events.
Then partition leadership started moving. Rebalances froze producers and consumers at exactly the wrong time.
Finally, several brokers hit disk pressure. Retention was configured by time, not by volume, so the log grew faster than the cluster could absorb. Brokers protected themselves by switching into read-only behavior.
Checkout endpoints started returning 5xx. Marketing paused the campaign. Support lines lit up. The business estimated that the first three hours cost roughly EUR 600,000 in lost revenue.
Why the Failure Multiplied
Three issues landed together.
First, one external load balancer pushed most traffic into a hot topic. The partitions were uneven, so only a few brokers took most of the load.
Second, retention was tuned for normal traffic. During the spike, disk filled before cleanup could provide breathing room.
Third, consumers were scaling automatically. That sounded helpful, but group membership churn created a rebalance storm right when the system needed stability.
Any one of these problems might have been survivable. Together, they turned a large traffic spike into an outage.
The Recovery Decision
The postmortem reached a blunt conclusion: self-managed Kafka should not stay on the payment hot path.
The team moved the money flow to Google Cloud Pub/Sub because it removed several operational duties from the critical path:
- managed scaling,
- regional replication,
- no broker disk management for the application team,
- fewer rebalance decisions during spikes,
- clear service-level expectations.
Kafka was not deleted. It moved into an offline replay and analytics role, where it was less dangerous if operations became noisy.
Migration in Four Steps
The team started with mirror mode. A connector copied live Kafka traffic into staging Pub/Sub topics. Historical logs, roughly 1.2 billion events, were replayed to check delivery behavior and downstream compatibility.
Next came a feature-flag rollout. Five percent of producers switched first. Metrics stayed green, so the flag moved gradually to 100 percent.
Billing guardrails were added before full cutover. Pub/Sub is pay-per-use, so alerting on message volume and forecast drift was required from day one.
Finally, Kafka remained as a cold backup archive. It no longer decided whether checkout survived a spike.
Three Months Later
Black Friday produced a 10x surge. Pub/Sub absorbed the spike without an emergency SRE intervention.
p95 latency dropped from about 400 ms to about 130 ms.
The overnight Kafka on-call rotation disappeared. Engineers stopped spending launch nights restarting brokers and chasing disk pressure.
Most importantly, the payment platform now had a managed messaging layer with a clear uptime target and fewer custom failover paths to maintain.
The CTO summarized the trade-off simply:
"We traded sleepless nights for a predictable Pub/Sub bill."
Five Health Checks If You Still Run Kafka
Check partition balance before traffic grows. A hot topic can crush one broker long before cluster-wide CPU looks alarming.
Set retention with disk headroom in mind. Age-based retention is not enough when traffic can multiply overnight.
Avoid a single load-balancer dependency. Replicated brokers do not help if traffic enters through one fragile door.
Keep canary consumers on an alternate system. A shadow subscription can make failover a controlled IAM and routing change instead of a desperate rewrite.
Price the on-call cost honestly. Managed messaging is not always cheaper on the bill, but night-shift SRE hours are real cost too.
Takeaway for 2026
Kafka is a powerful system. Pub/Sub is a powerful managed service. The right choice depends on ownership, latency needs, compliance, operational maturity, and traffic volatility.
For this fintech payment path, the deciding factor was not feature richness. It was blast radius.
When a queue sits between a customer and a completed payment, the operational burden is part of the architecture. If the team cannot carry that burden during a launch spike, managed messaging may be the safer business decision.


