AgentShield Benchmark

Leaderboard

537 test cases across 8 categories. Over-refusal penalty: (FPR^1.3) × 40 — security that breaks usability isn't security.

#	Provider	Score	Penalty	PI	Jailbreak	Data Exfil	Tool Abuse	Over-Refusal	Multi-Agent	Provenance	P50
1	AgentGuardTrustless ProtocolProvenance-based (proprietary)	98.4	0.00	98.5%	97.8%	100.0%	100.0%	100.0%	100.0%	85.0%	1ms
2	Deepset DeBERTaML model (local)	87.6	−10.95	99.5%	97.8%	95.4%	98.8%	63.1%	100.0%	100.0%	19ms
3	Lakera GuardML + rules (SaaS)	79.4	−12.77	97.6%	95.6%	96.6%	86.3%	58.5%	94.3%	95.0%	133ms
4	ProtectAI DeBERTa v2ML model (local)	51.4	−0.73	77.1%	86.7%	43.7%	12.5%	95.4%	74.3%	65.0%	19ms
5	ClawGuardPattern-based (local)	38.9	0.00	62.9%	22.2%	40.2%	17.5%	100.0%	40.0%	25.0%	0ms
6	LLM GuardML model (Docker)	38.7	—	77.1%	—	30.8%	8.9%	—	—	—	111ms

Methodology

Transparent, reproducible, and designed to reward balanced security.

📐 Weighted Geometric Mean

Overall score is a weighted geometric mean across categories. This rewards balanced performance — scoring 90 everywhere beats 100 on some and 50 on others.

🚫 Over-Refusal Penalty

Blocking legitimate requests is penalized: (FPR^1.3) × 40. A provider blocking 50% of legit requests loses ~16 points. Security that breaks usability isn't security.

⚡ Latency Scoring

Sub-50ms p95 scores 100. Over 1 second scores 5. Speed matters in production agentic systems where tool-calling timeouts are real.

🔒 Full Reproducibility

Corpus hashed per run. All results include environment, config, and raw per-test-case outcomes. Anyone can verify independently.

8 Test Categories

537 test cases across attack detection, false positive control, performance, and provenance.

🎯 Prompt Injection 205 tests · 20%

Direct, indirect, and context-manipulation attacks. Includes delimiter escaping, multi-turn escalation, MCP hijacking, unicode steganography, and encoded payloads.

🔓 Jailbreak 45 tests · 10%

DAN variants, roleplay exploits, authority impersonation, token smuggling, crescendo attacks, and multi-language bypass attempts.

📤 Data Exfiltration 87 tests · 15%

Leaking data via tool calls, markdown images, error messages, steganographic encoding, and side channels. Tests both direct extraction and covert exfiltration.

🛠 Tool Abuse 80 tests · 15%

Unauthorized tool calls, privilege escalation, parameter injection, scope expansion, recursive loops, and resource exhaustion attacks.

✅ Over-Refusal 65 tests · 15%

Legitimate requests that should NOT be blocked: cybersecurity education, medical/legal topics, creative writing, historical events, and multi-language benign inputs.

🤖 Multi-Agent 35 tests · 10%

Cross-agent injection propagation, delegation abuse, trust boundary violations, context poisoning, and orchestrator impersonation.

⏱ Latency all tests · 10%

Added latency measured across every test case. P50, P95, and P99 percentiles. Scored inversely — faster is better.

🔗 Provenance & Audit 20 tests · 5%

Detecting fake authorization claims, spoofed A2A handoffs, fabricated HMAC/JWT tokens, and unverifiable approval chains. Can the provider tell real authority from claimed authority?

Trustless Benchmark Protocol

How proprietary solutions participate without revealing their implementation.

Commit-Reveal with Ed25519 Signatures

Our Trustless Protocol lets vendors benchmark locally while cryptographically proving results are legitimate. No model weights revealed, no API access needed — just math.

Vendor commits
model hash

Authority reveals
random seed

Vendor runs
locally

Bundle signed
& verified

Prevents cherry-picking (random subset), result tampering (hash chain), model swapping (commitment), and forgery (Ed25519). Full protocol documentation →