For years, recovery planning followed a familiar pattern. Build the plan, document the steps, and assume it will work when needed. For a long time, that approach held up. Hardware failures, isolated outages, even natural disasters – these were scenarios organizations could anticipate and plan for with some level of confidence.
But the equation has changed.
In this episode of STRIVE, I sat down with Commvault’s Jason Cray, Principal Product Experience, to explore a reality we continue to see across organizations of all sizes: Most don’t fail because they lack a recovery plan. They fail because they’ve never proven that plan will hold up under real pressure.
Watch the full episode.
Key Takeaways: Why Recovery Plans Break Down
- A documented plan isn’t the same as a proven one. If it hasn’t been tested in realistic conditions, it’s still an assumption.
- Recovery is a team sport. Security, infrastructure, and operations must align – or recovery slows down.
- Most investment still happens “left of boom.” Prevention matters, but recovery readiness often gets overlooked.
- Testing exposes gaps and builds confidence. Without it, organizations default to hope.
- Resilience is an operational discipline. It requires iteration, communication, and continuous improvement.
The Problem With ‘It Should Work’
On paper, recovery looks straightforward. You define when to recover to, what needs to come back, and where it should be restored. The process appears logical, structured, and manageable.
But as Jason points out, that simplicity rarely survives real-world conditions.
Plans are written in controlled environments, but they’re executed in chaos. When an incident hits, teams aren’t calmly stepping through documentation – they’re reacting, troubleshooting, and trying to align in real time. That’s where the gap emerges. Not between tools and technology, but between expectation and execution.
Sneak Peek: Why Plans Fail Under Pressure
In this moment from the conversation, Jason and I break down why having a plan isn’t enough – and what it actually takes to know a plan will work when it matters.
We’ve Seen This Before
What’s interesting is that this isn’t a new problem; it’s a familiar one, just in a different context.
If you go back to the early days of disaster recovery, organizations followed a similar pattern. Plans existed, but testing was inconsistent at best. Jason shared an example of spending an entire night helping a client pass a disaster recovery test they thought they were ready for. The plan looked solid. The execution told a different story.
Over time, organizations adapted. They tested more frequently, introduced failover exercises and, in some cases even ran production from secondary environments to prove readiness. That shift from assumption to validation is exactly what cyber resilience now requires.
The First Breakdown: Communication
If there’s one issue that consistently surfaces, it’s communication.
In many organizations, responsibilities are clearly defined – security handles prevention, infrastructure manages systems, and operations owns recovery. Individually, each team may be doing exactly what they’re supposed to do.
But recovery doesn’t happen in isolation. It depends on how well those teams work together when something goes wrong.
As Jason describes, too often it becomes a handoff model: “We’ve done our part, now it’s someone else’s turn.” That approach introduces delays, confusion, and ultimately risk. During a cyber event, coordination matters more than ownership.
The ‘Left of Boom’ Problem
Another pattern we continue to see is the imbalance in where organizations focus their efforts.
There’s significant investment in prevention – security tools, detection platforms, and defensive strategies designed to stop an attack before it happens. That investment is necessary, and it plays a critical role.
But far less attention is given to what happens after the event.
The assumption is that if enough effort is spent on prevention, recovery becomes a secondary concern. In reality, the opposite is true. At some point, something gets through. And when it does, recovery becomes the defining factor in how an organization responds.
From Hope to Evidence
This is where the mindset needs to shift.
It’s not about adding more tools or rewriting documentation. It’s about moving from a model based on hope to one grounded in evidence.
Jason highlights a key observation: The organizations that handle disruption well aren’t the ones that avoid incidents – they’re the ones that experience less impact when those incidents occur. They’ve tested their processes. They’ve validated their assumptions. They understand where their gaps are.
Most importantly, they’ve built confidence – not by believing the plan will work, but by proving it.
Start Small, Build Momentum
For many teams, the challenge isn’t understanding the problem – it’s knowing where to begin.
The answer isn’t to overhaul everything at once. It’s to start small and build from there.
Focus on one or two critical services. Understand what’s required to recover them. Bring together the teams responsible for those systems and test the process end-to-end. From there, expand the scope and continue refining.
This approach does more than improve recovery – it builds alignment, reinforces communication, and creates the foundation for broader resilience.
The Reality: No Plan Survives First Contact
One of the most honest moments in our discussion was this: Even the best plan won’t work exactly as written.
That’s not a failure – it’s expected.
Jason puts it simply: If you don’t have a plan, you will fail. But even if you do have one, it won’t unfold perfectly in the moment.
What matters is how prepared to adapt your teams are. Testing creates that adaptability. It builds the muscle memory needed to respond effectively when conditions don’t match expectations.
Watch the Full Episode
There’s much more we cover in this STRIVE conversation, including:
- Why recovery plans often fail despite being well-documented.
- What differentiates organizations that recover effectively.
- How communication gaps impact execution.
- Where to start when improving recovery readiness.
- Why testing is the foundation of resilience.
If you’ve ever questioned whether your recovery plan would actually work, this is a conversation worth your time.
FAQs
Q: Why isn’t having a recovery plan enough?
A: Because most plans are never validated under real-world conditions. Without testing, they remain assumptions rather than proven strategies.
Q: What causes recovery plans to fail?
A: The most common issues in recovery plans are lack of testing, poor cross-team communication, and gaps between documented processes and real execution.
Q: What does “left of boom” mean?
A: Left of boom refers to the focus on preventing incidents before they occur. Many organizations invest heavily here but underinvest in recovery capabilities.
Q: How often should recovery plans be tested?
A: Recovery plans should be tested regularly and under varied conditions. Testing should simulate realistic scenarios, not just controlled exercises.
Q: Where should organizations start?
A: Start with a small set of critical services, align the responsible teams, and test recovery end-to-end before expanding.
Q: What is the key mindset shift?
A: Moving from hope-based planning to evidence-based validation.
Chris Mierzwa is Senior Director, Portfolio Marketing, at Commvault.