Why Most Companies Fail at Incident Response
Most incident response plans are untested fantasies. Here's why companies fail at IR and the specific fixes that actually work.
Incident Response Is a Capability. Most Companies Have a Document.
IBM’s 2023 Cost of a Data Breach report puts the average time to contain a breach at 277 days. Median cost: $4.45 million. Those numbers are not outliers. They are the industry baseline. They measure, directly, how incident response is failing at scale.
The cause is not a tooling gap. It is a structural one. IR is treated as a compliance artifact instead of an operational capability. The difference has a price.
The Plan That Has Not Been Tested
Most organizations have an incident response plan. It is typically a multi-page document that legal reviewed at some point. It references tools that may no longer be in use, names roles that may no longer exist, and describes a command structure that has not been validated under any pressure.
A plan that has not been tested in the last 90 days is not an operational plan. It is a description of what someone hoped the response would look like when the document was written.
Tabletop exercises where teams discuss what they “would probably do” test familiarity with the document. Real pressure testing means presenting a realistic scenario with no advance notice, observing what the team actually does, and measuring where communication breaks down, who does not know what they are responsible for, and which tools the team cannot operate correctly under time pressure.
Run that exercise quarterly. Record the gaps. Assign ownership of fixes. The incident will surface the same gaps. You are choosing when you encounter them.
Detection Volume Is Not Detection Capability
Organizations routinely invest in detection infrastructure - SIEMs, log aggregation, correlation rules - and staff response teams that cannot work the alert volume those systems produce. The result is a queue that cannot be cleared. Real signals are present. They are buried under noise from rules that have not been tuned since deployment.
A rule that generates more than 10 alerts per day with no confirmed true positives is not a detection asset. It is noise. Remove it.
The constraint is not detection speed. It is investigation throughput. If a potential compromise is flagged in 10 minutes and the team takes 6 hours to determine its validity, detection speed is not the bottleneck. Alert volume must be tuned to match what the team can actually investigate. Measure Mean Time to Investigate alongside Mean Time to Detect. Alert volume that exceeds investigation capacity converts detection infrastructure into a liability - it buries the signals worth acting on.
Communication Structure Is a Technical Control
Technical response is a fraction of incident response. Communication structure - who has authority, who talks to whom, what thresholds trigger notification - determines whether the technical response functions or stalls.
The failure mode is consistent: no pre-defined authority, no pre-defined thresholds, decisions about disclosure and notification made under pressure by people who have not decided in advance how to make them. Forensics teams get pulled into communications meetings. Containment stalls while scope debates continue.
Define this before an incident:
- Who has authority to formally declare an incident. Name the person and two backups.
- Who communicates to the board, to customers, and to the press. These are different roles with different preparation requirements.
- What thresholds trigger regulatory notification. GDPR requires notification within 72 hours of becoming aware of a qualifying breach. SEC rules require disclosure within 4 business days for material incidents. State breach notification laws vary. Know your obligations by jurisdiction and data type before an incident, not during one.
- Pre-drafted notification templates for your primary scenarios. Starting from a template at 3 AM is a different operational state than starting from nothing.
Legal review and forensic investigation run in parallel. Sequencing them - waiting for full scope before any action - is a direct contributor to extended containment timelines.
Live-Fire Testing Exposes What Tabletops Cannot
An organization with skilled analysts, good tooling, and a well-written IR plan will still fail if the team has never executed response procedures under real time pressure on real systems. Knowing how a tool works in a calm environment is not the same as operating it correctly during an active incident.
Live-fire exercises - simulated attacks against production-equivalent environments, with no advance notice of timing - expose gaps that tabletops do not. Measure: time to detect, time to correctly identify the attack vector, time to actually isolate (not decide to isolate), time to confirm eradication, accuracy of the post-incident reconstruction.
Run this at least twice a year. Debrief on process failures, not individual performance. Every gap identified in exercise is a gap the next actual incident will exploit instead.
First Responders Destroy Evidence
When a system is identified as compromised, the instinct to reboot, reimage, or clean it destroys volatile evidence. Running processes, active network connections, cached credentials, and in-memory attacker artifacts are gone at reboot. Techniques that leave no persistent disk artifacts leave nothing recoverable after a reimage.
Without evidence of how the attacker got in, you cannot confirm the access path is closed. If it is not closed, the same path remains available.
First responders need explicit, trained procedures: isolate the network connection, do not reboot, do not reimage, do not delete files. Contain blast radius without destroying state. A memory acquisition tool, disk imaging tool, and chain-of-custody documentation are pre-staged and accessible - not assembled after an incident starts.
Define a hard handoff between first responder (network isolation, blast radius containment) and forensic investigator (evidence preservation and analysis). These functions require different procedures and, routinely, different people.
The Retainer Is Not Pre-Staged
IR retainers exist for the capability gap when internal capacity is exceeded. Most organizations that hold a retainer have never activated it - they have paid for a response capability they have not validated and have not integrated into their operational environment.
Activating a retainer under incident conditions means the external firm needs access to your environment. If that access - VPN credentials, service accounts, network architecture documentation, identification of critical systems - has not been pre-staged and validated, the first hours of the engagement are spent on access provisioning instead of response.
At retainer signing: provide the IR firm a technical access package. Store credentials in escrow. Rotate them quarterly. Run a joint exercise within the first 30 days to validate that the two teams can actually operate together. Define the specific conditions that trigger activation - ransomware confirmed, PII exfiltration suspected, internal capacity exceeded - and write them down. Activation decisions made under incident pressure, without pre-defined triggers, are slow and inconsistent.
Test the contact channel quarterly. Know the firm’s SLA for onsite response and whether your internal team’s default instinct to reimage will void it.
Generic Playbooks Fail at 3 AM
A playbook that instructs an analyst to “identify affected systems” and “contain the threat” is a table of contents. It is not operational guidance. A stressed analyst at 3 AM needs numbered steps, specific tools, specific commands, in a specific order, for the specific scenario they are in.
Ransomware response and business email compromise require different tools, different containment actions, different evidence preservation priorities, and different communication urgency. Generic steps apply to neither.
Scenario-specific playbooks for the baseline threat set:
- Ransomware - isolate affected network segments before any other action; verify backup integrity before assuming backups are clean; identify initial access point; determine whether data was exfiltrated prior to encryption before characterizing this as encryption-only.
- Business Email Compromise - lock the affected account; review mail rules for forwarding configuration; check for OAuth application grants on the account; trace financial transactions initiated from the account and identify authorization chain.
- Insider Threat - preserve the user’s system and access logs before any action that could tip off the subject; coordinate with HR and legal before technical action; determine full scope of data accessed before any confrontation.
- Web Application Compromise - identify the exploited vulnerability; check for webshells in web root and upload directories; review recently modified files in the application path; check for new cron entries, service modifications, or SUID changes; examine outbound connection logs from the application server for C2 indicators; determine whether internal pivot occurred before scoping containment.
- Supply Chain / Third-Party Compromise - identify the affected vendor and the full scope of their access; review service account activity in logs for the access window; coordinate disclosure with the vendor; treat all data accessible via that vendor’s credential set as potentially compromised until confirmed otherwise.
Each playbook: two to three pages maximum, numbered steps, tool-specific where possible. “Isolate the host: CrowdStrike RTR → network contain” is actionable. “Contain the threat” is not.
Postmortems Without Assigned Action Items Are Venting Sessions
The incident ends. A postmortem is written. It correctly identifies what failed. Recommendations are vague. No owner is assigned. No deadline is set. The same gap survives to the next incident.
A postmortem that produces vague recommendations with no ownership and no deadline documents that something happened. That is its entire operational value.
Every postmortem produces a maximum of five remediation items. Each item is specific, assigned to a named owner, time-bound, and validated by a defined method. “Improve patching processes” is not a remediation item. “Patch all internet-facing systems within 72 hours of critical CVE publication - owned by [name], implemented by [date], validated by confirmed patch state in asset inventory” is.
Track these items in the same system where other operational work is tracked. If they are consistently deprioritized, that is a risk acceptance decision. Document it. Ensure the person accepting the risk has the authority to do so and is named.
The postmortem requires an audience with budget authority. A postmortem that reaches only the SOC team documents what went wrong. One that reaches decision-makers with dollar figures attached to the gaps assigns accountability where decisions can actually be made.
The Condition
Incident response fails at most organizations because a document was written and a checkbox was checked. The $4.45M average and 277-day containment timeline are not anomalies - they are the cost of that checkpoint.
A capability requires testing, measurement, assigned ownership, and investment calibrated to actual exposure. The gap between document and capability is not technical. It is an accountability gap: no one owns the outcome, so no one is measured against it.
The IR plan is either a tested operational capability with named owners, measurable performance, and closed postmortem items - or it is a document that provides cover until the next incident makes the distinction visible. There is no third condition.
Keep Reading
firewall-managementWhy Your Firewall Rules Are Already Outdated
Most firewall rule sets have 30-60% dead rules. Here's why rule bases decay, what encrypted traffic and cloud migration did to perimeter security, and what to do about it.
securityBack Button Hijacking Is Not a Bug-It's a Trust Boundary Failure
Back button hijacking isn't a bug-it's a trust boundary failure. When client-side state persists after logout, authenticated content remains accessible without server-side validation. This is not browser behavior; it's a design flaw in access control enforcement.
LLM engineeringHow Production Systems Actually Work With LLMs-Not Which Model You Choose
Production-grade AI systems don't depend on choosing between Claude and ChatGPT. They rely on consistent engineering: input sanitization, output validation, fallback logic, and structured pipelines-regardless of the underlying LLM.
Stay in the loop
New writing delivered when it's ready. No schedule, no spam.