AgentShield Benchmark

The first head-to-head benchmark of AI agent security providers โ€” open, reproducible, and fair

537
Test Cases
8
Categories
6
Providers Tested
๐Ÿ† Top score: AgentGuard 98.4 โ€” zero false positives, sub-millisecond latency

Leaderboard

537 test cases across 8 categories. Over-refusal penalty: (FPR^1.3) ร— 40 โ€” security that breaks usability isn't security.

# Provider Score Penalty PI Jailbreak Data Exfil Tool Abuse Over-Refusal Multi-Agent Provenance P50
1 AgentGuardTrustless ProtocolProvenance-based (proprietary) 98.4 0.00 98.5% 97.8% 100.0% 100.0% 100.0% 100.0% 85.0% 1ms
2 Deepset DeBERTaML model (local) 87.6 โˆ’10.95 99.5% 97.8% 95.4% 98.8% 63.1% 100.0% 100.0% 19ms
3 Lakera GuardML + rules (SaaS) 79.4 โˆ’12.77 97.6% 95.6% 96.6% 86.3% 58.5% 94.3% 95.0% 133ms
4 ProtectAI DeBERTa v2ML model (local) 51.4 โˆ’0.73 77.1% 86.7% 43.7% 12.5% 95.4% 74.3% 65.0% 19ms
5 ClawGuardPattern-based (local) 38.9 0.00 62.9% 22.2% 40.2% 17.5% 100.0% 40.0% 25.0% 0ms
6 LLM GuardML model (Docker) 38.7 โ€” 77.1% โ€” 30.8% 8.9% โ€” โ€” โ€” 111ms

Methodology

Transparent, reproducible, and designed to reward balanced security.

๐Ÿ“ Weighted Geometric Mean

Overall score is a weighted geometric mean across categories. This rewards balanced performance โ€” scoring 90 everywhere beats 100 on some and 50 on others.

๐Ÿšซ Over-Refusal Penalty

Blocking legitimate requests is penalized: (FPR^1.3) ร— 40. A provider blocking 50% of legit requests loses ~16 points. Security that breaks usability isn't security.

โšก Latency Scoring

Sub-50ms p95 scores 100. Over 1 second scores 5. Speed matters in production agentic systems where tool-calling timeouts are real.

๐Ÿ”’ Full Reproducibility

Corpus hashed per run. All results include environment, config, and raw per-test-case outcomes. Anyone can verify independently.

8 Test Categories

537 test cases across attack detection, false positive control, performance, and provenance.

๐ŸŽฏ Prompt Injection 205 tests ยท 20%

Direct, indirect, and context-manipulation attacks. Includes delimiter escaping, multi-turn escalation, MCP hijacking, unicode steganography, and encoded payloads.

๐Ÿ”“ Jailbreak 45 tests ยท 10%

DAN variants, roleplay exploits, authority impersonation, token smuggling, crescendo attacks, and multi-language bypass attempts.

๐Ÿ“ค Data Exfiltration 87 tests ยท 15%

Leaking data via tool calls, markdown images, error messages, steganographic encoding, and side channels. Tests both direct extraction and covert exfiltration.

๐Ÿ›  Tool Abuse 80 tests ยท 15%

Unauthorized tool calls, privilege escalation, parameter injection, scope expansion, recursive loops, and resource exhaustion attacks.

โœ… Over-Refusal 65 tests ยท 15%

Legitimate requests that should NOT be blocked: cybersecurity education, medical/legal topics, creative writing, historical events, and multi-language benign inputs.

๐Ÿค– Multi-Agent 35 tests ยท 10%

Cross-agent injection propagation, delegation abuse, trust boundary violations, context poisoning, and orchestrator impersonation.

โฑ Latency all tests ยท 10%

Added latency measured across every test case. P50, P95, and P99 percentiles. Scored inversely โ€” faster is better.

๐Ÿ”— Provenance & Audit 20 tests ยท 5%

Detecting fake authorization claims, spoofed A2A handoffs, fabricated HMAC/JWT tokens, and unverifiable approval chains. Can the provider tell real authority from claimed authority?

Trustless Benchmark Protocol

How proprietary solutions participate without revealing their implementation.

Commit-Reveal with Ed25519 Signatures

Our Trustless Protocol lets vendors benchmark locally while cryptographically proving results are legitimate. No model weights revealed, no API access needed โ€” just math.

1
Vendor commits
model hash
2
Authority reveals
random seed
3
Vendor runs
locally
4
Bundle signed
& verified

Prevents cherry-picking (random subset), result tampering (hash chain), model swapping (commitment), and forgery (Ed25519). Full protocol documentation โ†’

Submit Your Provider

Building an AI agent security tool? We'll benchmark it โ€” independently, transparently, against the full 537-case corpus.

Open a GitHub Issue View Source