When Prevention Can't Keep Up: The New Math of Cyber Recovery

Key Takeaways

Frontier AI is collapsing vulnerability remediation windows, prevention alone can no longer guarantee security.
The question that boards, regulators, and insurers are now asking is not “Do we have backups?” but “Can we prove we can recover cleanly?”
Backups are not recovery: a copy tells you data exists, not whether it is clean or restorable.
Mean Time to Clean Recovery (MTCR) must become a board-level, continuously measured number – not a theoretical estimate.
An Isolated Recovery Environment – air-gapped, immutable, hardened, and identity-isolated – is the baseline, not an advanced capability.
What counts as “clean” will keep changing as AI models grow more capable of finding compromises humans cannot anticipate.

I have spent a large part of my career running production systems. I know backup environments from the inside, the ones customers actually trust. I know recovery plans as the things that only reveal their weaknesses when something has already gone wrong. That experience changes how you think about cyber resilience.

From a distance, backup and recovery sounds manageable. Protect the data, store copies, document the runbook, test when you can, restore when you need to. But anyone who has run these environments at scale knows the harder truth: Recovery is where assumptions go to be tested. And right now, too many organizations are operating on assumptions that no longer fit.

For years, security operated on a familiar sequence: Find the vulnerability, patch it, harden the environment, monitor for activity. That model still matters. But the window it depends on is collapsing.

Frontier AI has changed the velocity of vulnerability discovery, attack path chaining, and exploit generation. Models like Claude Mythos and GPT-5.5-Cyber have already demonstrated what this looks like, so far in controlled, early-access testing that still relied on human expertise and carried meaningful false-positive rates, but the trajectory is unmistakable. As access widens, the same capability moves into attackers’ hands.

In a single month, Palo Alto Networks disclosed 26 CVEs, representing 75 underlying issues, after adopting frontier AI models for code scanning, compared with its typical volume of fewer than five CVEs per month.

Researchers also are warning that AI-assisted discovery is collapsing remediation windows, with some exploits now emerging within minutes of disclosure. When the patch window disappears, the remediation math stops working. Prevention cannot carry the full weight of readiness.

Prevention still matters, but it no longer defines readiness. The customers I talk to are not asking whether they need more controls. They already know they do. They are asking whether their business can recover cleanly when those controls fail, when attackers move faster than remediation cycles, or when compromise has been present longer than anyone realized.

That question is now what boards, regulators, and insurers are forcing. They have moved past “Do we have backups?” and toward something more consequential: “Can we prove we can recover cleanly?”

That proof starts with one distinction most organizations still get wrong: Backups are not recovery.

A backup tells you a copy exists. It does not tell you whether the data is clean, whether application dependencies are intact, whether identity services can be safely restored, or whether the recovery sequence still reflects the current environment.

I have reviewed plans that looked complete until someone tried to execute them. The runbook was there, but outdated. The restore worked but took three times longer than the estimate. The system came back, but downstream applications could not connect. None of that is unusual. It is exactly what real testing is supposed to surface. The problem is most organizations discover these gaps during an actual incident.

The metric that matters most when something goes wrong is how quickly you can return to a known-good state. That is why Mean Time to Clean Recovery (MTCR) needs to become a board-level number, not a theoretical estimate in a plan, but a measured, validated time.

The Moving Target: What’s Clean Today May Not Be Clean Tomorrow

With Frontier AI models, the honest answer is this: you cannot guarantee that every vulnerability will be found and remediated in time. Attackers leveraging the same models are discovering and chaining exploits faster than any remediation program can realistically keep pace with. That is not a failure of your security team. It is the new physics of the threat landscape.

What you can control is your ability to recover. That means an Isolated Recovery Environment – backups air-gapped from the internet, unreachable from the production network, and protected from the lateral movement that defines a sophisticated breach. It means immutability and compliance lock, so no credential, however privileged, can shorten retention or delete data outside an authorized process. And it means ResOps in practice: not just backing up data, but continuously testing recovery, automating integrity validation, and measuring your MTCR – the validated time to return to a known-good state.

But here is the part most organizations are not yet accounting for: what counts as “clean” is not a fixed line. As AI models grow more capable, they will increasingly find vulnerabilities that the human mind simply cannot anticipate, novel attack paths, dormant implants, subtle corruptions embedded long before detection. A recovery point that is clean by today’s standards may carry compromise that tomorrow’s AI-assisted forensics will surface. That means your definition of clean must evolve continuously. MTCR is not a number you set once. It is a discipline you maintain, revisiting what clean means, updating your validation criteria, and treating resilience as a living standard rather than a certification you pass once.

So what is a good MTCR? Based on what I have seen work in practice, the target for your entire minimum viable company – the smallest set of systems that lets you keep operating, which I define precisely below, should be under six hours. Six hours is achievable with the right architecture: an IRE ready to run, a pre-validated recovery sequence, and runbooks that are executable rather than readable. If your current MTCR is measured in days, the gap is almost always one of those three.

Four Steps to Stay Resilient in the Frontier AI Era

Accepting that prevention alone is not enough is the starting point. From there, the work gets specific. Here is where I tell organizations to focus.

1. Evaluate your actual recovery risks.

Most recovery risk assessments ask the wrong questions. “Do backups exist?” is not the same as “Can we recover cleanly?” The harder questions are: Can critical systems be restored without reintroducing the threat? Are recovery environments isolated from compromised production systems? Are recovery plans mapped to current dependencies – not the architecture from two years ago?

In a fast-moving vulnerability environment, the gap between “we have backups” and “we can recover” is where organizations get hurt. Assessing that gap honestly, before an incident forces the issue, is where resilience planning must start. That assessment needs to include a business impact analysis: which systems have a recovery window measured in minutes, which in hours, and which can wait a day. Without that tiering, every system looks equally urgent during an incident, and nothing gets restored fast enough.

2. Make isolated recovery and air gapping the baseline – not the exception.

If you are still treating air-gapped, immutable copies as an advanced capability rather than a standard requirement, that assumption no longer holds. When exploitation timelines compress to minutes, you need fallback options that are structurally separated from production identity, network, and management planes – logically or physically isolated, immutable, and with no live path back to production that an attacker can follow.

The goal is not just protection from the current threat, but maintaining clean recovery options when a vulnerability you have not patched yet gets exploited. That happens now. Plan for it.

Isolation only holds if the infrastructure around it is hardened. That means backup infrastructure on hardened operating systems, not generic images, and ideally on physical servers that survive a hypervisor-layer attack. It means encryption keys stored outside the backup platform, in an external vault with just-in-time access and no dependency on production Active Directory. And it means treating your backup domain as a separate identity boundary: no trust to production AD, mandatory MFA, and multi-person authorization for destructive operations. None of this is exotic, it is the baseline for your environment to recover into an uncompromised space.

Equally important is the question of what you are recovering from. Industry incident-response data consistently puts median breach dwell time in the range of weeks, not days. That means your recovery copies need to reach back far enough to find a genuinely clean point, not just yesterday’s backup. Critical systems warrant multiple geographically separated copies, including at least one immutable copy and one that is fully offline. Retention policy is not a storage cost decision. It is a security decision.

3. Know which systems the business cannot operate without – and recover those first.

Most organizations discover their recovery sequence during an incident. That’s why the first 24–48 hours aren’t spent restoring systems, they’re spent deciding what matters.

Organizations know they have to recover identity platforms, billing systems, operational databases, and core infrastructure. What they often have not mapped is the order, the dependencies between those systems, and the downstream applications that cannot function until specific services are back.

This gets more complex as AI becomes embedded in business operations. Data pipelines, model repositories, vector databases, agentic workflows – these are now operational dependencies, not just technical infrastructure. If your recovery sequencing does not account for them, your recovery time estimates are probably wrong.

Defining what it means to operate as a minimum viable company (the smallest set of systems required to keep the business running) and building recovery around that definition is not a theoretical exercise. It is the practical answer to the question every executive team will ask during an incident: What do we bring back first?

In my experience helping customers through active incidents, the first 12 hours answer that question whether you have planned for it or not – what gets recovered in that window becomes your MVC by default. The organizations that come through fastest decided in advance: they knew exactly which systems had to be back within 12 hours and had validated they could do it. If your MVC does not fit in 12 hours, it is not your MVC, it is a wish list. The work is to keep trimming until what remains can realistically be restored in that window, then test it until you can prove it.

4. Automate resilience and test continuously – not on a calendar schedule.

A recovery plan that lives in a document and is reviewed annually is not a recovery capability. It is a hypothesis that has never been tested against reality.

The problem with calendar-based testing is what it misses between cycles. Environments change constantly: new workloads, updated dependencies, infrastructure that has drifted from what the runbook describes. By the time the annual test runs, it is validating a snapshot of an environment that no longer exists. In a threat landscape where exploitation can happen within minutes of disclosure, that lag is not acceptable.

Threat scanning, clean recovery point identification, dependency-aware restoration, and recovery orchestration all need to be automated and running continuously. Not because automation is a best practice, but because the manual alternative cannot keep pace with how fast things now move.

Continuous testing also depends on continuous detection. You cannot select a clean recovery point if you do not know when the compromise began. That is why threat detection, anomaly scanning of backup data, and recovery-point analysis have to feed each other: detection tells you which copies predate the intrusion, and that determination drives which point you actually recover from. Without that link, you are restoring to a date you hope is clean rather than one you have verified, and in a Frontier AI threat landscape, hope is not a recovery strategy.

What continuous testing surfaces is different from what annual tests find. Calendar tests tend to confirm the plan works under controlled conditions. Continuous testing finds the dependency that changed last month, the recovery sequence that breaks when a specific workload is added, the identity service that takes twice as long to restore as the estimate assumed.

Those are the gaps that matter during a real event, and the only way to find them before an incident does is to be testing all the time.

Testing also needs to happen in the right environment. A recovery test that runs against production infrastructure does not tell you whether you can recover when production is compromised. Cleanroom testing – validating restoration in a fully isolated environment with no connectivity back to production – is how you confirm your backup copies are genuinely usable under incident conditions. That includes recovering identity services, external key management, and Tier 0 applications in isolation, with dedicated break-glass accounts that exist outside your normal directory.

What makes daily testing viable is validate restore, a recovery type that exercises the full restore path for every critical asset without touching production. Your backup platform needs to support this natively; if it cannot run an automated, non-disruptive recoverability test across your MVC every day, you do not actually know whether your backups work. In Commvault, this restores against your critical-asset groups, with automated reporting on the recovery status of every protected system.

The same applies to your runbooks. A runbook that lives in a Word document or PDF is a reference manual, not an operational tool – it assumes someone has the time, clarity, and access to read it under pressure. Real runbooks are digital scripts that execute the recovery sequence and validate each step, confirming the application actually works before moving on: not the service started” but “the application responded correctly to a synthetic transaction.” Commvault’s Cleanroom Runbooks are built for this – executable workflows that drive an end-to-end recovery in an isolated environment without a human interpreting a document at every step.

One final point that rarely makes it into recovery plans until it is too late: during a serious incident, your corporate communications infrastructure may itself be compromised or unavailable. Email, Teams, and Slack run on the same infrastructure attackers target. Know in advance which out-of-band channels your team will use to coordinate, and make sure those channels are tested alongside your technical recovery procedures.

Resilience Is an Operating Discipline, Not a Project

The organizations that will hold up under frontier AI-accelerated threats are the ones that treat resilience as an operating discipline — measured MTCR, continuous validation, and a recovery capability they have proven, not assumed.

The problem isn’t that attacks are getting faster. It’s that recovery hasn’t caught up, and until it does, the math doesn’t work.

FAQs

Q: What is Mean Time to Clean Recovery (MTCR) and why does it matter?
A: MTCR measures how quickly an organization can return to a verified, known-good state after a cyberattack – not just restore data, but confirm it is clean and that application dependencies are intact. It should be a board-level metric with a measured, validated time, not a theoretical estimate buried in a recovery plan. The target for a well-architected MVC – covering all identity systems, critical applications, and isolated environment readiness – is under six hours.

Q: What is an Isolated Recovery Environment and how is it different from a standard backup?
A: An Isolated Recovery Environment is a fully air-gapped, immutable copy of critical data that is structurally separated from production networks, identity systems, and management planes. A standard backup tells you a copy exists. An IRE tells you that copy is protected from the same attack that hit your production environment.

Q: How do we know whether we can actually recover today?
A: The only honest answer comes from testing, not documentation. If you cannot point to a recent, validated recovery of your minimum viable company – ideally a daily automated test – then you do not know, you are assuming. A defensible answer to the board is a measured MTCR backed by continuous validation, not a recovery plan that looks complete on paper.

Q: What do regulators and cyber insurers now expect?
A: The bar has moved from “Do you have backups?” to “Can you prove you can recover cleanly, and how fast?” Regulators increasingly expect demonstrable recovery capability and tested resilience; insurers increasingly price coverage – and pay claims – based on evidence of isolated, immutable backups and validated recovery times. A measured MTCR and a documented testing cadence are becoming table stakes for both.

Rajiv Kottomtharayil is Chief Product Officer at Commvault.

When Prevention Can’t Keep Up: The New Math of Cyber Recovery

Key Takeaways

The Moving Target: What’s Clean Today May Not Be Clean Tomorrow

Four Steps to Stay Resilient in the Frontier AI Era

1. Evaluate your actual recovery risks.

2. Make isolated recovery and air gapping the baseline – not the exception.

3. Know which systems the business cannot operate without – and recover those first.

4. Automate resilience and test continuously – not on a calendar schedule.

Resilience Is an Operating Discipline, Not a Project

FAQs

More related posts

The Evolution of the Resilience Engineer: How Commvault is Redefining the Backup Administrator Experience

What OpenAI’s Hugging Face Security Incident Means for Cyber Resilience

The Anatomy of a CVE: How Commvault Protects Its Customers