Episode 19

The Live Ops Disaster

"The test passed. The load didn't."
20 min read
Week of June 8 - 12, 2026

Pixel Spree's summer event launches with no realistic staging environment and no load testing. Hassan's infrastructure cracks under traffic three times the model predicted. Mariana pushes emergency fixes while Stefan confronts Katja about months of avoiding hard questions on capacity planning. By Friday, the team has learned that code correctness is meaningless without operational realism — but they've also burned through two weeks of headway from TDD adoption in a single weekend.

Previously: "When Tests Become Religion" — Stefan introduced test-driven development to the backend team, and they spent five days proving that writing the test first changes more than code: it changes who trusts the system. By Friday, the backend team was writing smaller code with clearer intent, Daniel was quietly realizing automated tests might replace his approval theater, and the DORA metrics showed pull requests staying under one hundred lines for four straight days.

Monday, 08:47 — The Launch That Shouldn’t Happen

Katja in a glass conference room with Stefan and Lukas as the Navigator display highlights repeated warnings about staging parity, load testing, and event traffic assumptions.
"We don't have staging parity. We have configuration files and hope."

Katja knew before she opened her laptop that Monday was going to be a fight.

She’d spent Sunday reading Hassan’s logs from Saturday night: three failed load tests, staging environment unable to mirror production scaling behavior, no way to know if the summer event infrastructure would hold at 3x traffic without breaking something expensive. She’d read them twice and then called Stefan before she’d had coffee.

He’d asked one question: “What’s your risk tolerance?”

She’d answered honestly: “Zero.”

That was why she was standing in Lukas’s office at 08:47 with her laptop open, the Navigator synthesis surfacing the same warnings that had kept her awake until 02:30.

“I’m not signing off on this launch,” she said.

Lukas didn’t look up from his phone. “Hassan says we’re ready.”

“Hassan says he can monitor and react,” Katja corrected. “There’s a difference between monitoring a fire and having sprinklers that actually work.”

Stefan was leaning against the wall with his arms folded, watching Lukas’s face change in slow motion. He’d been here three days when Katja called him Sunday night. He’d read the logs. He’d asked to see the staging environment. He’d seen what she’d seen: a mirror that broke under load because it wasn’t actually mirroring anything except the surface configuration files.

“We don’t have time for another round of load testing,” Lukas said finally, not looking up from his phone. “Marketing has already sent the email blast. The community managers are posting on social media at noon. We’re committed.”

“Then you’re committing to a disaster,” Katja said. “That’s all I’m saying. We don’t have operational realism. We have configuration files and hope.”

Lukas finally looked up. His face was the kind of tired that doesn’t improve with sleep. “What do you need?”

“A week to build proper staging parity. Two weeks for realistic load testing. A rollback plan that doesn’t involve manually editing production databases at 03:00.”

“And if we launch today and something breaks?” Lukas asked.

“Then we break it in front of players instead of breaking the player experience permanently,” Katja said. “We have a hotfix pipeline. We can patch. But if we ship this without knowing where the infrastructure cracks, we’re not launching an event. We’re launching a stress test on our team’s patience.”

Lukas turned to Stefan. “You’re the outsider here. What do you think?”

Stefan pushed off the wall and walked toward the dashboard. The summer event traffic projections were displayed in bright green lines climbing over time: expected players peaking at 180,000 concurrent users on Saturday morning. A number that felt wrong to him because it came from a model built on last quarter’s data without accounting for social media amplification or the influencer partnerships they’d just signed.

“We’ve been having this conversation for months,” Stefan said quietly. “Every time we say ‘we need load testing,’ someone says ‘we don’t have time.’ Every time we say ‘staging needs parity,’ someone says ‘we can fix it later.’ The pattern is clear.”

Lukas’s jaw tightened. “And what’s the conclusion you’ve reached?”

“That code correctness doesn’t matter if operational assumptions are wrong,” Stefan said. “I saw Mariana write a test on Friday that proved her payment logic was bulletproof. It passed every unit test. Every integration test. The problem is nobody has tested whether the infrastructure can handle 180,000 players trying to make payments simultaneously.”

He turned back to Lukas. “You’re asking me if we should launch today. My answer is no. Not because I’m opposed to shipping. Because I’ve seen this movie before and it ends with three developers working Sunday night while the rest of us pretend we didn’t know it was coming.”

Lukas looked at Katja again. His face had changed. The kind of change that comes from realizing you’re about to make a choice you’ll regret in six months.

“Okay,” he said finally. “What’s the minimum viable safety check?”

Katja exhaled so hard she felt it in her knees. “We run a controlled canary launch. 5% of players get the event on Monday. We monitor for two hours. If nothing breaks catastrophically, we expand to 20%. Then 50%. Then full.”

Lukas nodded once. “Do it.”

He turned and walked out before either of them could say anything else.

Stefan looked at Katja. “That went better than I expected.”

“I told him the truth,” she said. “Sometimes that’s all people need to hear.”

She closed her laptop. “Now we have two hours before marketing launches the announcement. We should probably tell Hassan and Mariana what we’re doing.”

Monday, 14:32 — The Canary Launch

Hassan at his desk with three monitors showing server metrics while Mariana points at a spike in the graph and Stefan watches through the glass wall.
"At seven percent of players, the graph was already screaming."

The canary launch went live at 14:00 UTC.

At 14:32, Hassan’s second monitor started flashing red.

“Concurrent connections hitting hard limit,” he said without looking up. “We’re at 85% of capacity with only 7% of players connected.”

Mariana was already pulling up her terminal. “What’s the limit?”

“Default load balancer setting from six months ago. Not tuned for this event.”

She started typing. “Can we increase it without restarting?”

Hassan watched the numbers climb. “Not safely. If we hit the connection ceiling hard, players start getting dropped packets. The game feels laggy. Then they think it’s their internet and leave reviews.”

“Then we restart,” Mariana said.

“That causes a brief disconnect for everyone currently connected to that node,” Hassan said. “The canary group is still small. Maybe two thousand players max right now.”

“Do it.”

Hassan executed the command. The graph spiked briefly as connections dropped and re-established, then settled at a lower number. He’d increased the limit by 40%.

“We bought ourselves time,” he said. “But this isn’t sustainable. We need proper load balancing configuration before Saturday.”

Mariana was already looking at the payment logs. “We have another problem.”

Hassan turned to her screen. Payment transactions were climbing, but so was the error rate. Not catastrophic yet. Just enough to notice.

“Payment gateway timeout,” Mariana said. “The external API has a rate limit we didn’t account for. Players are trying to purchase boosters during peak event moments and the gateway is rejecting them.”

“We added caching on Monday,” Hassan said. “That should have reduced load.”

“It did reduce load,” Mariana agreed. “But now we’re hitting cache invalidation bottlenecks instead of database queries. The system didn’t break. It just broke somewhere else.”

Stefan appeared at the end of the desk row without anyone noticing him arrive. He’d been watching the metrics for ten minutes and hadn’t said anything until he saw Mariana’s face change when she noticed the cache issues.

“What did you expect?” Stefan asked quietly.

Mariana turned to him, eyes bright with that dangerous energy she got when she was about to be honest in a way that made executives uncomfortable. “I expected the payment system to work because we wrote tests for it.”

“You wrote tests for the payment logic,” Stefan corrected. “You didn’t write tests for whether the external API can handle 180,000 players trying to purchase boosters simultaneously during a weekend event.”

“Because that’s not in scope,” Mariana said. “That’s operations.”

“It is now,” Stefan said. “Code correctness isn’t enough if your operational assumptions are wrong. You can have perfect unit tests and still ship something that breaks under real load.”

Hassan was already typing commands to add circuit breakers around the payment gateway calls. “We need exponential backoff on retries. The current configuration is hammering the API when it starts failing, which makes things worse.”

“Done,” Mariana said. “Added three lines of code that might save us from a cascade failure.”

Stefan watched her type. She was working fast but not frantically. There was a rhythm to her movements now that he hadn’t seen before TDD. The kind of confidence that comes from knowing your tests will catch regressions while you’re fixing the immediate problem.

“How many players are affected?” Stefan asked.

“About 12% of canary group,” Hassan said. “Error rate climbing but not catastrophic yet.”

Mariana looked at him. “What happens if we don’t fix this before full launch on Saturday?”

Hassan didn’t answer immediately. He was watching the graphs and thinking about what would happen when that 12% error rate applied to 180,000 players instead of two thousand.

“We’ll have angry players,” he said finally. “We’ll have support tickets. We might lose revenue for the event. But we won’t lose data or corrupt player progress.”

“That’s the difference between a bad launch and a catastrophic one,” Stefan said.

Mariana nodded once. “Then we fix this before Saturday.”

“We already are,” Hassan said, not looking up from his screen.

Tuesday, 16:21 — The Load Test That Failed

Katja, Stefan, and a small group of developers watch load test results on a wall display as the infrastructure graph collapses.
"The model said stable. The graph said otherwise."

The load test was supposed to happen at 16:00.

It started at 16:03 because Hassan needed three more minutes to configure the monitoring properly. That delay turned into a disaster that would keep them talking about it for months.

They’d built the test on assumptions that felt reasonable until they failed catastrophically. They’d modeled player behavior based on last quarter’s data. They’d assumed linear scaling of payment requests with concurrent users. They’d assumed the load balancer configuration would hold at 3x traffic because it had held at 2x in testing.

The test started successfully. The system scaled as expected up to 50,000 concurrent connections. Then it hit a wall that nobody had predicted.

“Database connection pool exhausted,” Hassan said from the conference room corner where he’d been monitoring metrics. “We’re hitting the max pool size before we hit the load balancer limit.”

Katja turned to him. “What’s the max pool size?”

“500 connections,” Hassan said. “Configured six months ago when our peak concurrent users were 40,000. We didn’t adjust it for this event because we thought we’d have time to review capacity planning in Q3.”

Stefan was looking at the graph that showed database CPU climbing toward 100% as connection requests piled up. “How many players does 500 connections actually support?”

“Depends on query complexity,” Hassan said. “With this event’s reward system and inventory updates, maybe 8,000 concurrent users before we start seeing timeouts.”

Mariana appeared in the conference room doorway with a laptop open and her eyes bright with that dangerous energy she got when she was about to be honest in a way that made executives uncomfortable. “We have another problem.”

She pulled up a chair and rotated the screen toward the group. The payment gateway error rate had climbed from 12% during the canary launch to 34% under load test conditions.

“The caching layer is invalidating too aggressively,” she said. “Every time a player purchases something, we’re invalidating the entire cache for that player’s inventory. At scale, that means thousands of cache misses per second. The database is getting hammered twice: once from connection pool exhaustion and once from cache stampede.”

Stefan looked at her. “You wrote tests for this scenario?”

Mariana shook her head. “I wrote tests for the payment logic. I didn’t write tests for whether our caching strategy would hold under load because that felt like operations work, not development work.”

“That’s the mistake,” Stefan said quietly. “We keep separating code correctness from operational realism. But they’re the same thing. A system isn’t correct if it works at 10 users and fails at 10,000.”

Katja was looking at Hassan. “What do we need to fix this before Saturday?”

Hassan was already typing commands on his terminal. “We need to increase the connection pool size. We need to implement query result caching at the application level so we’re not invalidating entire inventories for single purchases. And we need to add read replicas to handle the increased query load.”

“How long?” Katja asked.

“Two days minimum,” Hassan said. “But that’s assuming we don’t break something else while making these changes.”

Mariana looked at him. “We can do it with tests first. I’ll write tests for the cache invalidation logic. You handle the connection pool and read replicas.”

Hassan nodded once. “Agreed.”

Katja was already looking at her phone. “I’m telling Lukas we need to delay the full launch by two days.”

Stefan watched her type. “Will that be enough?”

“It won’t solve everything,” Katja admitted. “But it buys us time and reduces player exposure if something still breaks.”

She looked up at Stefan. “This is what you meant about operational realism, isn’t it? We can have perfect code and still ship something broken because we didn’t test the right things.”

Stefan nodded. “You can pass every unit test in the world and still fail if your assumptions about scale are wrong.”

Mariana was already back at her desk typing tests. The kind of work she’d started doing on Monday that made Stefan smile when he thought nobody was looking. She wasn’t just fixing problems anymore. She was writing tests to prevent them from happening again.

Wednesday, 19:58 — The Near Miss

Hassan and Mariana work late at their desks while Stefan watches through the glass wall, the office mostly dark except for the lit development floor.
"Nothing had failed yet. That was what made it dangerous."

The near miss happened at 19:58 on Wednesday because of course it happened at 19:58.

Hassan had just merged the connection pool fix when the monitoring system started flagging unusual patterns in the payment gateway responses. Not failures yet. Just latency climbing in ways that suggested something was about to break.

“We’re seeing response times double on payment requests,” he said, pulling up the metrics dashboard. “The gateway is still responding but at 800ms instead of 150ms.”

Mariana was already looking at the logs. “Is it the rate limiting kicking in?”

“No,” Hassan said. “It’s something else. The gateway is queuing requests and processing them slower than usual.”

“Are we getting throttled?” Mariana asked.

“Not yet,” Hassan said. “But if response times climb another 200ms, we’ll start seeing timeouts on the client side. Players will think the game is broken and start leaving reviews.”

Stefan appeared at their desk row without anyone noticing him arrive. He’d been working from his own desk but had watched the metrics climb for ten minutes before coming over.

“What’s the root cause?” he asked.

Hassan turned to him. “We’re not the bottleneck. The gateway is. But we’re triggering it because our retry logic is hammering them when they start slowing down.”

Mariana was already typing. “So we need exponential backoff on retries with jitter.”

“Exactly,” Hassan said. “The current configuration retries immediately on timeout, which means if the gateway is slow, we send five requests in one second instead of one. That makes the problem worse.”

Stefan watched her type. She was working fast but not frantically. There was a rhythm to her movements now that he hadn’t seen before TDD. The kind of confidence that comes from knowing your tests will catch regressions while you’re fixing the immediate problem.

“How many players are affected right now?” Stefan asked.

“About 5% of active sessions,” Hassan said. “Not catastrophic yet but climbing.”

Mariana finished typing and pushed the commit. “That should help reduce load on the gateway during slow periods.”

Hassan watched the metrics for two minutes. Response times started dropping back toward normal levels. The retry hammer had stopped.

“We bought ourselves time,” he said. “But we need to talk to the payment gateway team about capacity planning for this event.”

Mariana looked at him. “We can’t talk to them on a Wednesday night.”

“No,” Hassan agreed. “But we should send an email first thing Thursday morning and ask for emergency capacity review.”

Stefan was looking at both of them with something that might have been pride if he wasn’t too tired to feel it properly. “You’re doing the right work,” he said quietly. “Not just fixing problems but preventing them from happening again.”

Mariana looked up at him. “We should write tests for this scenario though. So we don’t forget next time.”

“That’s exactly what I was thinking,” Stefan said.

Hassan nodded. “I’ll draft the test cases. You handle the implementation?”

“Deal,” Mariana said.

They both turned back to their screens and went back to work. The office was quiet except for keyboard clicks and the hum of servers running in the background. Outside, Berlin was dark and mostly still. Inside, three developers were working on a problem that would keep them awake until 02:30 if they let it.

Friday, 17:44 — What the Week Proved

Katja and Stefan review the Navigator synthesis in the glass conference room while Mariana writes tests and Hassan works at his desk beyond the glass wall.
"The code had held. The assumptions had not."

By Friday evening, the office felt different again.

Not transformed.

That would have been too easy.

But different in the way a room feels after somebody finally opens the right window.

Katja had pulled Stefan into the conference room at 16:50 with the same look she’d had Monday two weeks earlier when the DORA metrics first landed on her screen. Except this time it wasn’t dread. It was concentration with a trace of satisfaction she didn’t trust enough to name out loud yet.

Navigator filled the wall display.

The backend repository panel showed what a week of changed behavior looked like when translated into numbers. Pull request sizes had stayed below one hundred lines for four straight days. Review comments were shorter but more specific. Commit frequency had lost its old burst pattern and spread itself across the workday like people were actually integrating instead of hoarding.

But the most interesting metric was the one that made Katja smile when she saw it: zero production incidents during the summer event launch despite starting without proper staging parity or load testing.

“The canary worked,” Stefan said before she could ask.

“Yes,” Katja agreed. “But only because we had tests that caught the payment gateway issues on Monday and the cache invalidation problems on Tuesday.”

She zoomed in on the backend graph. Mariana’s activity stood out in all the right ways: more commits, smaller diffs, tests added before behavior changes. One reverted experiment that had been rolled back in twelve minutes because the failing test told her exactly what she had misunderstood. Failure, contained early enough to be cheap. That was progress too, even if nobody put it in slide decks.

“The Unity side still looks unchanged,” Katja said. “Anton hasn’t touched TDD.”

“Probably not this week,” Stefan agreed. “He’s still fighting with us from inside the conversation instead of outside it. That’s progress in Anton-language.”

Katja looked past him to the QA area. Daniel’s Moleskine was closed again, which registered as a weather event after months of constant note-taking. He was talking to one of his testers while pointing at a screen, and from this distance Stefan couldn’t read the code, but he could read the posture: curious, not defensive.

“He asked me this morning whether Navigator could correlate test additions with escaped defects over time,” Katja said. “Then he asked whether QA should start writing executable acceptance checks for the most expensive regressions instead of waiting for development to hand them broken tickets in three formats.”

Stefan looked back at the wall display. “There it is,” he said.

“What?”

“The moment process people discover they don’t actually want more gates. They want better evidence.”

Katja folded her arms. “You’re pleased with yourself.”

“A little.”

“Disgusting.”

She smiled anyway, even though she didn’t trust herself to admit it out loud yet.

On the screen, Hassan’s activity stood out in all the right ways: more infrastructure commits, better documentation of deployment scripts, tests added for load testing scenarios that had failed on Tuesday. He’d also sent an email to the payment gateway team at 08:15 Thursday morning asking for emergency capacity review, and they’d responded within two hours with a temporary rate limit increase.

Read Sofia’s log note, Katja said.

She clicked into the daily log synthesis.

Sofia had written three lines Thursday night:

I thought TDD was about proving I was smart enough to think ahead. It isn’t. It’s about making the problem small enough that I stop pretending I understand more than I do.

Stefan read it twice. “That’s the whole curriculum,” he said.

Katja’s mouth curved. Not a smile exactly. Approval with better bone structure.

“Mariana’s is filthier,” she said.

She opened the next note.

Tests aren’t religion. They’re witness statements. Code lies. Tests make it lie less.

Stefan laughed hard enough that Katja had to wait to keep talking. “She’s not wrong.”

“No. She rarely is when she’s angry enough.” Katja glanced at the backend floor again. “And Anton?”

She opened his note.

It was one sentence.

I still hate the sermon, but I hate invisible requirements more.

That nearly did Stefan in.

Katja let the silence sit after that. Not sentimental. Just observant.

Outside the conference room, Friday was unwinding the way healthy offices do. Not by collapsing. By tapering. Chairs rolling back. Music low. Bags appearing. Hassan already gone at 17:03. Second Thursday in two weeks he’d left on time, and now a Friday to match. Nobody panicking. Nobody hovering over the pipeline waiting for a surprise knife.

“You know what changed first?” Katja said.

“Tell me.”

“Not the metrics. The posture.” She nodded toward the floor. “People look less hunted. Mariana isn’t bracing before every merge. Sofia isn’t apologizing before she asks questions. Daniel is still Daniel, but now he’s curious. Even Anton is fighting with us from inside the conversation instead of outside it.”

Stefan watched the room through the glass.

She was right.

The change wasn’t dramatic enough for executives. Not yet. No giant before-and-after. No hero shot. Just the almost invisible shift from anticipatory dread to situational attention. Teams that stop expecting the software to ambush them think differently. They even stand differently.

“What’s next?” Katja asked.

That was the real question. Not whether this week had worked. Whether the next crisis would let it keep working.

Stefan looked at the live ops calendar pinned beside the screen. Q3 planning starts Monday. Bigger event launch in August. Same infrastructure gaps Daniel had been nagging about for months. No proper staging mirror. Load testing still theoretical.

“Next,” he said, “reality checks whether the learning happened in time.”

Katja followed his gaze to the calendar.

Her face changed.

Not fear exactly.

Recognition.

“The event,” she said.

“Yes.”

They stood in silence for a moment, both of them looking at the same problem from different altitudes.

Then Katja closed the laptop.

“Fine,” she said. “At least now when it breaks, we’ll know more quickly why.”

“There speaks the soul of a CTO.”

“Fuck off,” she said, already opening the door.

He followed her back onto the floor.

Mariana looked up immediately. “Do we get a verdict?” she asked.

Katja grabbed her bag from the back of the chair. “Yes,” she said. “You’re all slightly less chaotic than Monday. Don’t get sentimental about it.”

Sofia laughed.

Anton lifted one hand in a mock blessing. “Go in peace, my children of the failing test.”

“You’re still an asshole,” Mariana said.

“Consistent character design,” Anton said.

Daniel, from across the floor, actually smiled.

Small. Brief. Real.

That might have been the strangest thing anyone saw all week.

Navigator — Katja Müller — 12 June 2026, 22:37

Summer event launch week. Canary approach saved us from catastrophe on Monday when we discovered payment gateway rate limits and cache invalidation bottlenecks under load.

Stefan’s point about operational realism landed harder than any lecture I’ve given in years. We can have perfect unit tests and still ship something broken if our assumptions about scale are wrong. That’s what happened on Monday: payment logic was bulletproof according to every test, but the infrastructure couldn’t handle real traffic patterns.

Mariana adopted TDD immediately after discovering this gap. She wrote tests for cache invalidation scenarios by Tuesday afternoon. Hassan implemented circuit breakers and exponential backoff on retries. By Wednesday night we had a near miss that could have been catastrophic without the monitoring systems they’d improved over the weekend.

Anton remains skeptical about TDD but admitted in his log that he hates invisible requirements more than “the sermon.” Daniel is asking whether QA should write executable acceptance checks instead of interpreting vague tickets after development is done. He may be realizing that automated tests can replace part of the approval theater he built to create safety.

Navigator signals from this week:

  • Zero production incidents during summer event launch despite starting without proper staging parity or load testing.
  • Backend pull request sizes stayed under 100 lines all week.
  • Test files were touched in 83% of backend commits since Monday.
  • Review comments got shorter and more specific.
  • Commit activity spread across the workday instead of clustering in end-of-sprint panic.
  • Unity repository unchanged so far. Skepticism visible in both code and logs.

The adoption curve is uneven. Good. Real change is uneven.

Next week is Q3 planning. We still have no proper staging environment that mirrors production scaling behavior. Load testing remains theoretical. Daniel has been right about this for months. If the August event blows up, it will not be because TDD failed. It will be because we are still learning one practice while carrying old infrastructure debt into a new traffic spike.

But the team is thinking earlier now. Smaller. Clearer. That’s not enough to save us from everything. It might be enough to show us exactly where the next fracture is.

Next Episode: "The Q3 Planning Crisis" Leadership demands ambitious goals for Q3 without addressing the infrastructure gaps that nearly caused disaster during the summer event. Katja and Stefan must convince Lukas to invest in operational realism while development continues adapting to TDD practices.
×
×