Autonomous & Agentic Security Assessment in 2026
Cybersecurity

Autonomous & Agentic Security Assessment in 2026

From periodic pentests to continuous autonomous assessment. The Excalibur benchmark, the agentic red team economy, and what it means for security programmes.

By Shah Mijanur 2026-04-14 10 min read
Autonomous and agentic security assessment 2026 — Excalibur benchmark

The Excalibur benchmark in February 2026 was a moment. A research team published results for an LLM-based penetration testing agent that compromised four out of five hosts in a realistic Active Directory engagement at a total cost of $28.50 in LLM API fees. Independent surveys place automated approach success rates at 69.5% versus 47.6% for manual efforts.

The numbers are imperfect, the benchmarks are not yet universally accepted, and the results vary widely by environment complexity. But the trajectory is clear: end-to-end autonomous security assessment is no longer experimental. It is operational, it is cheap, and it is reshaping how mature security programmes test their defences.

What "agentic" actually means in security assessment

Three generations of security assessment

Three generations of security assessment 1AutomatedPre-scripted, faster
Tools execute pre-scripted procedures. They scan, fuzz, report. They follow rules.
2AI-assistedML-augmented automation
Smarter mutation, smarter prioritisation, more relevant reports. Humans drive the engagement.
3AgenticReasons, plans, adapts
Agent decides whether to escalate, pivot, or extract — like a human red teamer. The qualitative shift.

The word agentic has been used loosely. In the context of security assessment, it has a specific technical meaning that distinguishes it from earlier approaches.

  • Automated security tools execute pre-scripted procedures faster than humans can. They scan, fuzz, and report. They follow rules.
  • AI-assisted tools use machine learning to make automated tools smarter — better mutation strategies for fuzzing, smarter prioritisation of findings, more relevant report content. The human still drives the engagement.
  • Agentic tools reason about their environment, plan their own next actions, and adapt based on what they observe. The agent decides whether to escalate privileges, pivot laterally, or extract data — in the same way a human red teamer would.

The third category is the qualitative shift. The question is no longer whether AI is in the loop, but how much autonomous decision-making the agent has and how the boundaries are set.

What agentic assessment actually delivers

From the engagements we have observed and conducted in 2025–2026:

Speed. Where a manual penetration test of a mid-sized environment took 2–4 weeks of operator time, agentic assessment of the same scope completes in 4–24 hours. The economic implication: continuous assessment is now feasible. The traditional "annual pentest" becomes a quarterly or even continuous exercise.

Coverage breadth. Agents do not get tired. They explore attack paths systematically, including paths a human would deprioritise as "probably not worth the time." This catches a category of vulnerabilities that manual testing routinely misses.

Cost. The Excalibur benchmark's $28.50 LLM cost for a successful AD compromise is not the full picture — there is preparation, scope definition, reporting, and human review required around the agent. But the order of magnitude is real. The cost of a credible end-to-end assessment has fallen 10×–100× compared to 2023.

Reproducibility. Agent runs can be re-executed deterministically (with the same prompts and target state), which makes regression testing meaningful. After remediation, you can re-run exactly the assessment that found the original issue.

Where agents are still weak

The honest version is that agentic assessment has clear limits:

  • Novel exploitation. Agents excel at known attack patterns, but novel custom exploits — bug-hunting in proprietary protocols, complex chained logic flaws — still require human creativity.
  • Social engineering. The judgement required for high-quality social engineering is not yet matched by autonomous agents. The agent can draft phishing content, but the targeting and timing strategy still benefits materially from human judgement.
  • Business logic vulnerabilities. Vulnerabilities that exploit the specific business semantics of an application — race conditions in payment flows, authorisation logic edge cases — are harder for general-purpose agents to discover than human testers familiar with the domain.
  • Strategic prioritisation. Deciding which findings to escalate, which are noise, and which represent the real systemic risks remains a human judgement task.

The continuous assessment model

The 2026 mature posture is hybrid and continuous, not periodic and manual. The model:

  • Continuous agent-based assessment runs across the production environment, with daily or weekly runs depending on change velocity.
  • Quarterly human-led red team engagements focus on novel attack scenarios, business logic flaws, and the strategic gaps the agents do not catch.
  • Findings from both feed a single risk register, with prioritisation by potential impact.
  • Remediation cycles are tracked with target SLAs, and re-assessment confirms closure.

This is a meaningfully different operating model than the annual pentest engagement most organisations still run. The cost is not necessarily higher — the agent layer absorbs much of the volume — but the cadence and the integration with development workflows are fundamentally different.

Governance and accountability

Autonomous agents conducting security assessment raise governance questions that traditional pentesting did not. Specifically:

  • Scope discipline. An autonomous agent must respect engagement scope without human intervention at each step. The boundaries must be specified upfront, and the agent must refuse to step outside them.
  • Auditability. Every action the agent took, every decision it made, and every output it produced must be reconstructable for review. Without this, the assessment is not defensible.
  • Damage potential. Agents executing autonomously can cause unintended damage if they take actions in production systems. Pre-defined safe actions, dry-run modes, and reversibility checks are essential.
  • Regulatory alignment. Under BNM RMiT and PDPA, the assessment activities themselves must respect data handling and reporting requirements. An agent that exfiltrates personal data during testing has created a compliance issue, even when the testing was authorised.

What this means for security teams

If you are running or commissioning security assessment in 2026, three operational shifts:

  • Move toward continuous assessment with agent-based capabilities. Annual pentests are increasingly insufficient.
  • Reshape your security testing budget. The dollars previously spent on volume can be redirected to depth — fewer human-hours but on the most strategic, novel attack scenarios.
  • Invest in agent governance. Scope discipline, auditability, damage prevention, and regulatory alignment are not optional.

For Malaysian organisations building this capability with regulatory alignment to BNM RMiT, ISO 27001, and PDPA, our AI Agentic Security programme covers the architecture, the governance, and the hands-on tooling. HRDC SBL-KHAS claimable for eligible employers.

About the author

Shah Mijanur →

CISSP · Offensive Security · 12+ yrs Fintech & Banking · BNM RMiT

Shah is a cybersecurity practitioner with credentials including CISSP and offensive-security certifications, and 12+ years securing fintech, banking, and SaaS environments across APAC. He specialises in agentic security: prompt-injection defence, secrets management for AI workflows, RAG pipeline hardening, and aligning AI deployments with BNM RMiT, ISO 27001, and PDPA.

Frequently Asked Questions

Automated tools execute pre-scripted procedures. AI-assisted tools use machine learning to make automation smarter. Agentic tools reason about the environment, plan their own actions, and adapt based on observation — the agent decides whether to escalate privileges, pivot laterally, or extract data, in the same way a human red teamer would. The third category is the qualitative shift that defines the 2026 generation of security assessment tooling.

Not in the foreseeable future. Agents excel at known attack patterns, systematic coverage, and high-volume work. They are weaker on novel exploitation, social engineering judgement, business logic vulnerabilities, and strategic prioritisation. The 2026 mature posture is hybrid: continuous agent-based assessment plus quarterly human-led engagements focused on novel scenarios.

The cost per assessment has fallen by an order of magnitude or more. The Excalibur benchmark's $28.50 LLM cost is illustrative — though full operational cost including preparation, scope, governance, and human review is higher. The economics now support continuous assessment as the default, where annual pentests were the previous standard.

Four areas: scope discipline (the agent must respect engagement boundaries without per-step human intervention), auditability (every action and decision must be reconstructable), damage prevention (pre-defined safe actions, dry-run modes, reversibility checks), and regulatory alignment (BNM RMiT, PDPA, ISO 27001 requirements apply to the assessment itself, not just the assessed systems).

Yes, with the same authorisation framework as traditional pentesting — explicit written authorisation from the system owner, defined scope, and lawful conduct. The autonomous nature does not change the legal basis, but it does raise the bar for governance and audit trail quality. Engagements must be designed so that the agent's actions remain within the authorised scope at all times.

Want to apply this in your organisation?

AITraining2U runs HRDC-claimable corporate AI training for Malaysian organisations — from leadership awareness to hands-on builder workshops. Talk to us about a programme tailored to your team.