The Excalibur benchmark in February 2026 was a moment. A research team published results for an LLM-based penetration testing agent that compromised four out of five hosts in a realistic Active Directory engagement at a total cost of $28.50 in LLM API fees. Independent surveys place automated approach success rates at 69.5% versus 47.6% for manual efforts.
The numbers are imperfect, the benchmarks are not yet universally accepted, and the results vary widely by environment complexity. But the trajectory is clear: end-to-end autonomous security assessment is no longer experimental. It is operational, it is cheap, and it is reshaping how mature security programmes test their defences.
What "agentic" actually means in security assessment
Three generations of security assessment
The word agentic has been used loosely. In the context of security assessment, it has a specific technical meaning that distinguishes it from earlier approaches.
- Automated security tools execute pre-scripted procedures faster than humans can. They scan, fuzz, and report. They follow rules.
- AI-assisted tools use machine learning to make automated tools smarter — better mutation strategies for fuzzing, smarter prioritisation of findings, more relevant report content. The human still drives the engagement.
- Agentic tools reason about their environment, plan their own next actions, and adapt based on what they observe. The agent decides whether to escalate privileges, pivot laterally, or extract data — in the same way a human red teamer would.
The third category is the qualitative shift. The question is no longer whether AI is in the loop, but how much autonomous decision-making the agent has and how the boundaries are set.
What agentic assessment actually delivers
From the engagements we have observed and conducted in 2025–2026:
Speed. Where a manual penetration test of a mid-sized environment took 2–4 weeks of operator time, agentic assessment of the same scope completes in 4–24 hours. The economic implication: continuous assessment is now feasible. The traditional "annual pentest" becomes a quarterly or even continuous exercise.
Coverage breadth. Agents do not get tired. They explore attack paths systematically, including paths a human would deprioritise as "probably not worth the time." This catches a category of vulnerabilities that manual testing routinely misses.
Cost. The Excalibur benchmark's $28.50 LLM cost for a successful AD compromise is not the full picture — there is preparation, scope definition, reporting, and human review required around the agent. But the order of magnitude is real. The cost of a credible end-to-end assessment has fallen 10×–100× compared to 2023.
Reproducibility. Agent runs can be re-executed deterministically (with the same prompts and target state), which makes regression testing meaningful. After remediation, you can re-run exactly the assessment that found the original issue.
Where agents are still weak
The honest version is that agentic assessment has clear limits:
- Novel exploitation. Agents excel at known attack patterns, but novel custom exploits — bug-hunting in proprietary protocols, complex chained logic flaws — still require human creativity.
- Social engineering. The judgement required for high-quality social engineering is not yet matched by autonomous agents. The agent can draft phishing content, but the targeting and timing strategy still benefits materially from human judgement.
- Business logic vulnerabilities. Vulnerabilities that exploit the specific business semantics of an application — race conditions in payment flows, authorisation logic edge cases — are harder for general-purpose agents to discover than human testers familiar with the domain.
- Strategic prioritisation. Deciding which findings to escalate, which are noise, and which represent the real systemic risks remains a human judgement task.
The continuous assessment model
The 2026 mature posture is hybrid and continuous, not periodic and manual. The model:
- Continuous agent-based assessment runs across the production environment, with daily or weekly runs depending on change velocity.
- Quarterly human-led red team engagements focus on novel attack scenarios, business logic flaws, and the strategic gaps the agents do not catch.
- Findings from both feed a single risk register, with prioritisation by potential impact.
- Remediation cycles are tracked with target SLAs, and re-assessment confirms closure.
This is a meaningfully different operating model than the annual pentest engagement most organisations still run. The cost is not necessarily higher — the agent layer absorbs much of the volume — but the cadence and the integration with development workflows are fundamentally different.
Governance and accountability
Autonomous agents conducting security assessment raise governance questions that traditional pentesting did not. Specifically:
- Scope discipline. An autonomous agent must respect engagement scope without human intervention at each step. The boundaries must be specified upfront, and the agent must refuse to step outside them.
- Auditability. Every action the agent took, every decision it made, and every output it produced must be reconstructable for review. Without this, the assessment is not defensible.
- Damage potential. Agents executing autonomously can cause unintended damage if they take actions in production systems. Pre-defined safe actions, dry-run modes, and reversibility checks are essential.
- Regulatory alignment. Under BNM RMiT and PDPA, the assessment activities themselves must respect data handling and reporting requirements. An agent that exfiltrates personal data during testing has created a compliance issue, even when the testing was authorised.
What this means for security teams
If you are running or commissioning security assessment in 2026, three operational shifts:
- Move toward continuous assessment with agent-based capabilities. Annual pentests are increasingly insufficient.
- Reshape your security testing budget. The dollars previously spent on volume can be redirected to depth — fewer human-hours but on the most strategic, novel attack scenarios.
- Invest in agent governance. Scope discipline, auditability, damage prevention, and regulatory alignment are not optional.
For Malaysian organisations building this capability with regulatory alignment to BNM RMiT, ISO 27001, and PDPA, our AI Agentic Security programme covers the architecture, the governance, and the hands-on tooling. HRDC SBL-KHAS claimable for eligible employers.