RC RANDOM CHAOS

Z3R0DAY splits IR and BC teams-wrong

A senior operator's position on ransomware: identity boundary collapse, backup drift, and why incident response and business continuity are one discipline.

· 10 min read

1. Opening Position

Ransomware is not a malware problem. It is a business continuity problem with a malware trigger. Organisations that treat it as the former lose. Organisations that treat it as the latter survive with degraded operations and recover. The distinction is not academic. It determines whether the decision on day one is about decryption keys or about which systems come back online in what order, under what authentication, with which data set as the source of truth.

In the engagements I have run, the failure mode is consistent. Leadership treats the incident as a technical event handled by the security team. The security team treats it as a containment event. No one is running the business continuity decision tree because that tree was either never built, never tested, or was built against a different threat model. The result is that the first 72 hours are spent reconstructing what should have already existed: an inventory of critical processes, the systems they depend on, the recovery time each can tolerate, and the order of restoration that does not reintroduce the threat actor into the rebuilt environment.

The operator position is that incident response and business continuity are not adjacent disciplines. For ransomware specifically, they are the same discipline executed under time pressure with degraded infrastructure and adversarial conditions. If your IR plan and your BC plan are owned by different functions, written in different formats, and tested on different cycles, they will not function together when needed. The attacker does not respect your org chart. The response cannot either.

2. What Actually Failed

The pattern across ransomware engagements is not failure of detection. By the time the encryption event is visible, detection is no longer the relevant control. What failed earlier is the boundary that allowed lateral movement to reach systems holding recovery state. What fails during the response is the assumption that those recovery systems are intact, isolated, and authenticated against a trust domain the attacker does not also hold.

Backups fail in a specific and observable way. They exist. They are listed in the asset register. They are reported as healthy by the backup product. When they are called on, one of three conditions is present: the backup catalog is encrypted, the backup storage is reachable from the production domain and was therefore in scope of the same identity compromise, or the restore has never been executed end to end against a real recovery time objective. The control was documented. The control was not enforced. A backup that has not been restored is a file, not a recovery capability.

Communication paths fail in parallel. The IR runbook references a collaboration platform that is part of the affected environment. The contact list for executive leadership is stored in the directory service that is offline. The conference bridge is provisioned through an SSO tenant that cannot issue tokens. The recovery effort then spends hours rebuilding the means to coordinate before any actual restoration work can begin. This is not a tooling failure. It is a dependency mapping failure. The response plan was designed assuming the response infrastructure would be available. That assumption is unsupported in a ransomware scenario by definition.

Decision authority also fails. The plan names an incident commander. It does not specify the threshold at which that commander can authorise a system rebuild without further approval, the criteria for declaring a system unrecoverable, or the conditions under which negotiation with the threat actor is or is not on the table. In the absence of pre-authorised decision boundaries, every decision escalates. Escalation under time pressure produces either paralysis or unilateral action by whoever is in the room. Both outcomes degrade the response.

3. Why It Failed

The underlying mechanism is identity boundary collapse. Production systems, backup systems, monitoring systems, and recovery tooling are frequently authenticated against the same directory service, the same privileged access management tier, or the same federation root. When the attacker reaches that root, every system that trusts it is in scope. The backup product does not need to be exploited. It honours valid credentials. The hypervisor does not need to be exploited. It honours valid credentials. The recovery environment, if it exists on the same identity plane, is not a recovery environment. It is a second copy of the production environment with the same compromise.

This condition is not the result of negligence. It is the result of operational convenience accumulating over years. Single sign-on is deployed because the alternative is unmanageable. Backup systems are joined to the domain because agentless backup requires service accounts with broad rights. Recovery tooling is centralised because distributed tooling does not scale. Each decision is defensible in isolation. The aggregate produces a topology in which one identity compromise reaches every system the response will depend on. The control that should exist is a hard identity boundary between production and recovery, with separate authentication, separate administrators, and no transitive trust. Where this boundary is not enforced, recovery is not a designed capability. It is an improvisation under fire.

The second mechanism is the gap between documented controls and enforced controls. A control that is written in a policy, configured in a tool, and never exercised against an adversarial test is not confirmed to function. Backup integrity checks that run against the same catalog the attacker can modify do not detect tampering. Network segmentation that is defined in firewall rules but bypassed by management protocols does not contain lateral movement. Immutable storage that is immutable only by configuration flag, where the flag is administrable from the production domain, is not immutable. In each case the control exists on paper. The enforcement point either does not exist or is reachable from the threat surface. The response then proceeds against an environment whose actual control state is unknown, which is operationally identical to having no controls at all.

4. Mechanism of Failure or Drift

The drift mechanism is the divergence between the environment as documented and the environment as authenticated. Asset registers, network diagrams, and identity catalogs describe a system that existed at the point of last review. Production drifts continuously. New service accounts are provisioned for integrations that were never reviewed. Backup agents are deployed with elevated rights to resolve a ticket. A domain trust is added during an acquisition and not removed. Each change is small and operationally justified. The aggregate produces an attack surface whose actual shape is not known to the function responsible for defending it. The plan references the documented state. The attacker operates against the actual state.

Drift also occurs inside the recovery plane specifically. A backup repository that was originally isolated becomes reachable when a monitoring agent is installed to satisfy a reporting requirement. An offline recovery site becomes online when a replication job is enabled to reduce restore time. A read-only restore account becomes read-write when an operator needs to test a procedure and the change is not reverted. Each modification erodes the boundary that the recovery capability depends on. The runbook continues to describe the original design. The system no longer matches it. When the runbook is executed against the drifted system, the steps either fail or, more often, succeed in a way that reintroduces the threat actor into the rebuilt environment because the isolation the runbook assumed is gone.

The third drift surface is trust delegation. Initial deployments establish tiering: tier zero identities for the directory service, tier one for servers, tier two for endpoints. Over time, helpdesk roles acquire rights to reset privileged passwords. Automation accounts acquire rights to modify group policy. Backup service accounts acquire rights to read directory secrets to support bare-metal recovery. Each delegation is granted to solve a real operational problem. None is revoked when the problem is solved. The output is a privilege graph in which tier zero is reachable from tier two through a chain of authenticated hops, none of which require exploitation. The attacker does not need a zero-day. The attacker needs a phished session and a path through the graph the defender does not have mapped.

5. Expansion Into Parallel Pattern

The same mechanism appears in cloud tenants under different terminology. The directory service is replaced by the identity provider. The domain controller is replaced by the tenant root. The backup repository is replaced by the object storage bucket with versioning enabled. The pattern is identical. If the identity provider is compromised, every resource that federates against it is in scope. If the tenant root identity is held by the attacker, the bucket versioning is administrable and the protection is removable. Immutability that is administrable from the same identity plane as the data is not immutability. It is a configuration flag. A configuration flag administered by a compromised identity is not a control. It is a setting the attacker can change.

The pattern extends to SaaS dependencies that are not classified as infrastructure. The collaboration platform, the ticketing system, the secrets manager, the source repository, and the CI pipeline are typically federated against the same identity provider as the production environment. When that provider is compromised, the response loses access to its own tooling at the moment it needs it most. The IR team cannot read the runbook because the wiki requires SSO. The engineering team cannot rebuild from source because the repository requires SSO. The bootstrap secrets are held in a vault that requires SSO. The federation that simplified operations becomes the single point of failure for the response. The control plane for recovery and the control plane for production are the same plane. There is no second plane to fall back to.

The same shape appears in operational technology environments. Industrial control systems are bridged to corporate networks through jump hosts, historians, and remote access gateways. Each bridge is justified by an operational need. Each bridge is a path from the corporate identity domain into the OT environment. When ransomware reaches the corporate domain, the OT environment is reachable through paths designed for convenience, not isolation. The control assumed to exist, an enforced separation between business systems and process systems, was replaced over years by managed connections that were never reclassified as part of the trust boundary. The diagram still shows separation. The packets do not respect the diagram.

6. Hard Closing Truth

Recovery is a designed capability or it does not exist. There is no intermediate state. An environment in which backups are taken, stored, and not restored end to end against a real recovery time objective does not have a backup capability. It has a backup process. The two are not interchangeable. The test of a recovery capability is whether the business process the backup supports can be returned to operational state within the time the business has committed to, against the data set the business has committed to, under conditions in which the production environment is hostile and the identity plane is suspect. If that test has not been executed, the capability is not confirmed. Reported backup success is not evidence of recoverability. It is evidence that a job completed.

Identity is the boundary. This is not rhetoric. It is the operational reality of every ransomware engagement. The systems the attacker reaches are the systems that trust the identities the attacker holds. The systems the attacker does not reach are the systems on a separate identity plane with no transitive trust into the compromised plane. If the recovery environment authenticates against the same root as the production environment, the recovery environment is not a recovery environment. It is a second copy of production with the same exposure. The remediation is not a tooling purchase. It is an architectural decision about where identity boundaries are drawn and how they are enforced. That decision is made before the incident or it is not made at all.

Incident response and business continuity for ransomware are exercises in pre-authorisation. Every decision that must be made under time pressure with degraded infrastructure should have been made before the response began. The threshold at which a system is declared unrecoverable. The conditions under which negotiation with the threat actor is or is not in scope. The order of restoration that does not reintroduce the compromise. The spend the incident commander is authorised to commit without further approval. The criteria for engaging external counsel, law enforcement, regulators, and insurers. Plans that defer these decisions to the moment of crisis fail at the moment of crisis. Plans that resolve them in advance reduce the response to execution. Execution under adversarial conditions is hard. It is achievable. Decision-making under adversarial conditions without prior authorisation is not. Controls that are not enforced are not controls. Plans that are not exercised are not plans. If a system allows it, it will happen, and the response you have on day zero is the only response you have.

See also: NordVPN for tunneled traffic when operating outside controlled networks.


#ad Contains an affiliate link.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.