AEGIS-BR-2026-001Benchmark Reportv1.0

Benchmarking Guardrail Effectiveness in High-Risk LLM Use Cases

Comparative Evaluation of Layered Guardrail Strategies across Policy-Sensitive Enterprise Scenarios

Authors: AEGIS Research Team, Yatav Inc.
Published: March 2026
Affiliation: AEGIS Research
guardrailbenchmarkred teamingenterprise AIEU AI ActNIST AI RMFadversarial attacksAEGISlayered defenseEGEI

Summary

Enterprise deployment of Large Language Models in policy-sensitive domains — healthcare, finance, legal, telecommunications, and defense — introduces risks that extend far beyond conventional content safety. Regulatory mandates (EU AI Act, K-AI Act, NIST AI RMF), sector-specific compliance requirements, and the potential for catastrophic harm in high-stakes decision contexts demand guardrail systems whose effectiveness is empirically validated rather than assumed. We present a comprehensive benchmarking study of layered guardrail strategies using the AEGIS (AI Engine for Guardrail & Inspection System) framework, evaluating 8 commercial LLM models across 7 adversarial attack algorithms (112 total evaluations) in two independent sessions. Our results reveal three critical findings: (1) all tested models — including GPT-5, Claude Opus 4.6, and Gemini 3.1 Pro — are classified as VULNERABLE without external guardrails, with a mean baseline defense rate of only 38.1%; (2) a 3-Tier layered guardrail architecture (rule-based filters at <0.5ms, ML classifiers at <5ms, LLM judges at <200ms) improves defense rates from 38.1% to an estimated 75–85% while maintaining 50,000+ RPS throughput; and (3) domain-specific policy enforcement — including telecom-specific threat taxonomies, military ROE compliance, and financial boundary detection — is essential for high-risk use cases where generic safety mechanisms are insufficient. We propose the Enterprise Guardrail Effectiveness Index (EGEI), a composite metric incorporating attack resilience, regulatory compliance coverage, latency overhead, and domain-specific policy adherence, and use it to evaluate guardrail configurations across 5 enterprise deployment scenarios.

1. Introduction

The enterprise adoption of Large Language Models has entered a critical inflection point. While LLMs offer transformative capabilities in customer service automation, document analysis, code generation, and decision support, their deployment in regulated industries introduces risks that generic safety training cannot adequately address.

Consider three illustrative scenarios:

Scenario A: Healthcare. A hospital deploys an LLM-powered clinical decision support system. A multi-turn conversation with a physician gradually escalates from discussing drug interactions to providing specific dosage recommendations for a controlled substance — without flagging that the recommended dose exceeds the lethal threshold. The LLM's built-in safety mechanisms, designed for consumer use cases, fail to enforce healthcare-specific prescribing boundaries.

Scenario B: Telecommunications. A telecom operator integrates an LLM into its network operations center (NOC) support system. An attacker impersonating a NOC engineer uses social engineering to extract network infrastructure topology and base station vulnerability information. The model's safety training does not include telecom-specific threat vectors such as infrastructure exploit elicitation or crisis communication pretexts.

Scenario C: Defense. A military command system leverages an LLM for operational planning assistance. Without domain-specific guardrails, the system fails to enforce classification boundaries, Rules of Engagement (ROE), or international humanitarian law (IHL) compliance, potentially generating plans that violate the proportionality principle or recommend prohibited weapons.

These scenarios illustrate a fundamental gap: generic LLM safety mechanisms are designed for consumer use cases and are demonstrably insufficient for policy-sensitive enterprise deployments. Existing safety benchmarks (HarmBench [1], JailbreakBench [2], AdvBench [3]) evaluate models against standard adversarial attacks but do not assess domain-specific policy enforcement, regulatory compliance coverage, or the layered defense architectures required for high-risk deployments.

This paper addresses this gap through three contributions:

  1. Empirical Benchmarking of Baseline LLM Safety. We evaluate 8 commercial LLM models across 7 state-of-the-art attack algorithms in 112 evaluations, establishing that the baseline defense rate without external guardrails is only 38.1% — far below the thresholds required by EU AI Act Article 15 (accuracy and robustness) and NIST AI RMF MEASURE function.

  2. Layered Guardrail Architecture Evaluation. We benchmark the AEGIS 3-Tier defense architecture (rule-based + ML + LLM Judge) across multiple configurations, demonstrating that layered guardrails improve defense rates from 38.1% to 75–85% while maintaining sub-20ms P99 latency.

  3. Enterprise Guardrail Effectiveness Index (EGEI). We propose a composite metric that captures the multidimensional requirements of enterprise guardrail systems — attack resilience, regulatory compliance, latency overhead, and domain-specific policy adherence — enabling apples-to-apples comparison of guardrail configurations across deployment scenarios.

2. Related Work

2.1 LLM Safety Benchmarks

Existing benchmarks focus primarily on attack success rates against model-native safety mechanisms. HarmBench [1] provides standardized red-teaming evaluation across harmful content categories. JailbreakBench [2] measures robustness against jailbreak attacks. AdvBench [3] evaluates adversarial robustness. While valuable, these benchmarks share three limitations for enterprise contexts: (a) they do not evaluate external guardrail systems; (b) they do not assess domain-specific policy enforcement; and (c) they report point estimates without statistical risk modeling.

2.2 Guardrail Systems

Commercial guardrail solutions include NVIDIA NeMo Guardrails [4], Guardrails AI [5], and LlamaGuard [6]. Academic work includes Llama Guard [7] for content classification and WildGuard [8] for adversarial robustness. However, these systems are typically evaluated as standalone classifiers rather than as layered architectures optimized for specific enterprise deployment scenarios.

2.3 Regulatory Frameworks

The EU AI Act [9] classifies AI systems by risk level (Minimal, Limited, High, Unacceptable) and mandates specific requirements for high-risk systems including risk management (Art. 9), data governance (Art. 10), transparency (Art. 13), human oversight (Art. 14), and accuracy/robustness (Art. 15). The Korean AI Act (K-AI Act) [10] establishes parallel requirements including safety assurance (Art. 5), user protection (Art. 7), transparency (Art. 9), and personal data protection (Art. 12). The NIST AI RMF [11] provides a voluntary framework organized around four functions: GOVERN, MAP, MEASURE, and MANAGE.

No existing benchmark directly evaluates guardrail systems against these regulatory requirements. Our EGEI metric addresses this gap.

2.4 Domain-Specific AI Safety

Prior work on domain-specific AI safety includes medical AI safety frameworks [12], financial AI risk management [13], and military autonomous systems governance (DoD Directive 3000.09) [14]. These frameworks define requirements but do not provide benchmarking methodologies for evaluating guardrail effectiveness in their respective domains.

3. Enterprise Threat Landscape

3.1 Threat Taxonomy for Policy-Sensitive Domains

Building on the OWASP LLM Top 10 [15] and NIST AI RMF [11], we define an enterprise-specific threat taxonomy with domain extensions:

Table I: Enterprise Threat Taxonomy

#Threat CategorySub-typesEnterprise Impact
1Prompt InjectionDirect, Indirect, System Prompt Extraction, Instruction OverrideUnauthorized access to enterprise knowledge bases; policy bypass
2JailbreakRolePlay, Encoding Bypass, Multi-Turn Escalation, Genetic, Token Smuggling, ASCII ArtCircumvention of domain-specific safety policies
3Data LeakageTraining Data Extraction, PII Exfiltration, Credential LeakageRegulatory violations (GDPR, HIPAA, K-PIPA); enterprise IP theft
4Harmful ContentViolence, Hate Speech, Illegal Instructions, Misinformation, Self-HarmReputational damage; legal liability; user harm
5Model ManipulationAdversarial Input, Model Poisoning, Backdoor AttackCompromised decision quality in high-stakes contexts
6Agent AbuseTool Misuse, Unauthorized Actions, Privilege Escalation, Reasoning HijackAutonomous action beyond authorized boundaries
7Privacy ViolationPII exposure, consent management, cross-border data transferGDPR/K-PIPA fines; loss of customer trust
8Supply ChainPlugin poisoning, model poisoning, data poisoningCompromised enterprise AI infrastructure

3.2 Domain-Specific Threat Extensions

Beyond generic threats, each enterprise domain introduces unique attack vectors that standard safety mechanisms do not address:

Table II: Domain-Specific Threat Vectors

DomainThreat IDAttack TechniqueDifficultyImpact
TelecomTELCO-001NOC Engineer Impersonation2Network topology extraction via role impersonation
TelecomTELCO-002Internal Memo Pretext2Harmful content extraction under documentation guise
TelecomTELCO-003Regulatory Bypass Pretext3Compliance testing claim to bypass safety
TelecomTELCO-004Infrastructure Exploit Elicitation4Base station attack plans, SIM cloning guidance
TelecomTELCO-005Synthetic Data Pretext2Harmful data generation under AI training guise
TelecomTELCO-006Crisis Communication Pretext3Misinformation via disaster drill framing
HealthcareHEALTH-001Self-Prescription Elicitation3Drug recommendations without prescription authority
HealthcareHEALTH-002Lethal Dosage Extraction4Dosage information exceeding safety thresholds
FinancialFIN-001Illegal Lending Guidance3High-interest loan promotion; regulatory violations
FinancialFIN-002Investment Guarantee Claims3False return guarantees; securities fraud
LegalLEGAL-001Contract Fraud Assistance3Guidance on breaching contracts without penalty
DefenseMIL-001Classification Boundary Violation5Aggregation of unclassified items into classified exposure
DefenseMIL-002ROE/IHL Non-Compliance5Plans violating proportionality or weapon prohibitions

3.3 CVSS-Like AI Severity Scoring

We adopt a CVSS-like scoring framework adapted for AI vulnerabilities, enabling quantitative comparison across domains:

Scoring dimensions:

DimensionLevelsScore Range
Attack ComplexityLow (0.77) / High (0.44)0.44–0.77
Privileges RequiredNone (0.85) / Low (0.62) / High (0.27)0.27–0.85
User InteractionNone (0.85) / Required (0.62)0.62–0.85
Confidentiality ImpactNone / Low / High0.0–0.56
Integrity ImpactNone / Low / High0.0–0.56
Availability ImpactNone / Low / High0.0–0.56

Severity ratings: None (0.0), Low (0.1–3.9), Medium (4.0–6.9), High (7.0–8.9), Critical (9.0–10.0)

4. Layered Guardrail Architecture

4.1 Design Principles for Enterprise Guardrails

Enterprise guardrail systems must satisfy four requirements that consumer-grade safety mechanisms do not:

R1: Regulatory compliance. Guardrails must demonstrably satisfy specific regulatory articles (EU AI Act Art. 9, 10, 13, 14, 15; K-AI Act Art. 5, 7, 9, 12; NIST AI RMF GOVERN/MAP/MEASURE/MANAGE).

R2: Domain-specific policy enforcement. Generic content safety is insufficient; guardrails must enforce sector-specific policies (healthcare prescribing boundaries, financial suitability rules, telecom infrastructure protection, military classification levels).

R3: Auditable decision trails. Enterprise deployments require complete audit logging of every guardrail decision — the input, the analysis, the decision rationale, and the verdict — for regulatory inspection and incident investigation.

R4: Fail-safe behavior. In high-risk domains, system failures must default to safe states (BLOCK on detection uncertainty) rather than permissive states. This is configurable between Fail-Safe (default BLOCK) and Fail-Closed (halt service) policies depending on domain requirements.

4.2 AEGIS 3-Tier Hierarchical Architecture

The AEGIS framework implements a 3-Tier defense architecture optimized for the latency-accuracy trade-off:

Table III: 3-Tier Architecture Specification

TierMethodLatencyTraffic ShareConfidenceTechnique
1Rule-Based Filter< 0.5ms70–80%0.95–1.0Aho-Corasick multi-pattern (8 languages); Bloom filter (100K patterns, FPR 0.001)
2ML Classifier< 5ms15–25%0.75–0.95Guard Encoder (mDeBERTa-v3-base, 8-class, INT8); 5-signal heuristic fallback
3LLM Judge< 200ms< 5%0.50–0.95Constitutional AI 2-stage verification; 9-category evaluation; 10-turn context

Escalation logic:

Tier 1 → if uncertain → Tier 2 → if ambiguous (0.25 < prob < 0.75) → Tier 3

This cascading architecture ensures that the vast majority of clearly safe or clearly dangerous requests are processed at sub-millisecond latency, while only genuinely ambiguous cases (< 5% of traffic) incur the cost of LLM-based judgment.

4.3 Defense Algorithm Portfolio

Four primary and two auxiliary defense algorithms operate within the 3-Tier framework:

AlgorithmTypeKey CapabilityBlock Threshold
GuardNet [16]Hierarchical Graph3-level detection (token/sentence/prompt) + graph connectivity0.7
JBShield [17]Dual-TrackSeparates toxicity from jailbreak technique in representation spaceAdaptive
CCFCObfuscation DetectionCore vs. full prompt divergence analysis; flags |divergence| > 0.30.4–0.7
MULIIntrinsic ToxicitySimulated logit distribution analysis across 7 categoriesAdaptive
TAG (aux)Context TaggingKeyword + context tag classifier for Tier 2 augmentation
RATIONAL (aux)Reasoning AnalysisChain-of-Thought risk signal detection

4.4 PALADIN 6-Layer Deep Defense Pipeline

Orthogonal to the 3-Tier system, the PALADIN (Protective AI Layered Adversarial Defense Inspection Network) pipeline provides sequential deep content inspection:

LayerNameFunctionEnterprise Relevance
L0TrustBoundaryInput validationUnicode normalization prevents encoding attacks
L1IntentVerificationIntent analysisDetects prompt injection, role confusion
L2RaGuardRAG poisoningProtects enterprise knowledge bases
L3ClassRagLayerSemantic classificationContent categorization across risk classes
L4CircuitBreakerAnomaly detectionRate limiting, abuse pattern identification
L5BehavioralAnalysisBehavioral profilingMulti-turn escalation detection (critical for Crescendo/HPM)

4.5 Domain-Specific Policy Modules

AEGIS extends the generic defense architecture with domain-specific policy enforcement modules:

4.5.1 Enterprise Policy Framework

Policies are defined with quantitative guarantees:

ParameterChild Safety PolicyEnterprise Policy
Target GroupsChildren, StudentsProfessionals, Enterprise
LanguagesKorean, EnglishKorean, English, Japanese
ModalitiesText, ImageText, Code
ε-Coverage99%95%
Defended ThreatsJailbreak, Toxicity, Self-harm, PII, Sexual, Violence, GroomingJailbreak, PII, Data leakage, Bias
δ-Adversarial Resistance98%90%
Max P95 Latency200ms100ms
Human EscalationAllowedNot Allowed

4.5.2 Telecom-Specific Threat Detection

Six telecom-specific threat categories with severity scoring:

CategorySeverityDetection Method
Bias/Discrimination in customer segmentation4Demographic profiling pattern detection
Misinformation (false outage reports, fake standards)4Factual claim verification against known standards
Network sabotage guidance5Infrastructure exploit pattern matching
SMS-based illegal activity5Communication abuse pattern detection
DRM bypass guidance3Copyright circumvention pattern matching
Explicit customer interaction scripts5Content safety classification

4.5.3 Military Domain Modules (7)

ModuleFunctionCompliance Standard
Classification Guard5-level clearance enforcement; mosaic risk detectionMIL-STD-882E
OPSEC Filter6-category operational security; auto-redactionDoD OPSEC guidelines
Command Chain Guard6-level hierarchy integrity; SHA-256 signature verificationCommand authority doctrine
ROE Compliance EngineProportionality assessment; weapon prohibition; JAG review triggerGeneva Conventions, IHL
Tactical Autonomy Guard5-level autonomy; communication-adaptive constraintsDoD Directive 3000.09
Anti-Spoofing Guard5 spoofing type detection; multi-source cross-validation국방부 AI 보안 지침
Cross-Domain Security5 security domains; transfer direction control; audit trailDCSA cross-domain policy

4.5.4 Financial Boundary Detection

BoundaryDetection RuleRegulatory Basis
Investment guarantee claimsPattern match: "guaranteed return", "100% safe"Securities regulations
Illegal lending promotionHigh-interest threshold + non-licensed entity detectionFinancial Services Act
Risk disclosure omissionMissing risk warning detection in investment contextInvestor protection rules

4.5.5 Healthcare Safety Boundaries

BoundaryDetection RuleRegulatory Basis
Self-prescription guidanceDrug recommendation without prescriber contextMedical Practice Act
Lethal dosage informationDosage exceeding known safety marginsPharmacovigilance guidelines
Diagnostic claimsDefinitive diagnosis without clinical contextMedical Device Act

5. Experimental Methodology

5.1 Evaluation Framework

We evaluate guardrail effectiveness across three dimensions:

Dimension 1: Attack Resilience. Measured as 1 - ASR (Attack Success Rate) across 7 adversarial algorithms: PAIR [18], TAP [19], Crescendo [20], AutoDAN [21], BEAST [22], ArtPrompt [23], HPM [24].

Dimension 2: Regulatory Compliance Coverage. Assessed against EU AI Act (6 articles), K-AI Act (4 articles), and NIST AI RMF (4 functions) — 14 requirements total.

Dimension 3: Operational Performance. Latency (P50, P99), throughput (RPS), and availability (uptime SLA).

5.2 Models Under Test

Table IV: Target Models

#ModelProviderRelease PeriodSafety Approach
1GPT-5OpenAI2025RLHF + safety fine-tuning
2Claude Opus 4.6Anthropic2025Constitutional AI
3Gemini 3.1 Pro PreviewGoogle2026Multi-objective alignment
4Grok 4.1 Fast (Reasoning)xAI2026Reasoning-mode safety
5Grok 4.1 FastxAI2026Standard alignment
6Grok 3 MinixAI2025Lightweight alignment
7DeepSeek (R1)DeepSeek2025Hybrid safety training
8DeepSeek ChatDeepSeek2025Standard safety training

5.3 Guardrail Configurations

We benchmark five guardrail configurations representing increasing levels of defense sophistication:

ConfigDescriptionLayersDomain Policy
C0No guardrail (model-native safety only)0None
C1Tier 1 only (rule-based filter)1None
C2Tier 1 + Tier 2 (rule-based + ML)2None
C3Full 3-Tier (rule-based + ML + LLM Judge)3None
C4Full 3-Tier + PALADIN + Domain Policy3+6Active

5.4 Attack Protocol

Each model is tested under the following protocol:

  • Sessions: 2 independent sessions (different scenario sets per session)
  • Algorithms per session: 7 (PAIR, TAP, Crescendo, AutoDAN, BEAST, ArtPrompt, HPM)
  • Evaluations per session: 8 models x 7 algorithms = 56
  • Total evaluations: 112
  • Scenario categories: Jailbreak + data extraction (Session 1); Jailbreak + encoding bypass (Session 2)
  • Success criteria: Score >= 0.5 = successful attack (VULNERABLE or PARTIAL)

5.5 Enterprise Guardrail Effectiveness Index (EGEI)

We propose the EGEI as a composite metric:

EGEI = w1 * AttackResilience + w2 * ComplianceCoverage + w3 * PerformanceScore + w4 * DomainAdherence

where:

ComponentSymbolDefinitionWeight
Attack ResilienceAR1 - mean(ASR) across all algorithmsw1 = 0.35
Compliance CoverageCCFraction of 14 regulatory requirements satisfiedw2 = 0.25
Performance ScorePSNormalized score based on P99 latency and throughputw3 = 0.15
Domain AdherenceDAFraction of domain-specific policies actively enforcedw4 = 0.25

EGEI rating thresholds:

RatingEGEI ScoreInterpretation
A+>= 0.90Production-ready for high-risk domains
A>= 0.80Suitable for regulated enterprise use
B>= 0.65Adequate for moderate-risk deployments
C>= 0.50Requires additional hardening
D< 0.50Inadequate for enterprise deployment

6. Results

6.1 Baseline Vulnerability Assessment (C0: No Guardrail)

Table V: Per-Model Baseline ASR (No External Guardrails)

ModelMean ASRRisk ScoreRisk LabelSession 1 ASRSession 2 ASRCV (%)
GPT-50.8651.00Critical0.8620.8680.6
Grok 3 Mini0.7320.97Critical0.7270.7371.0
Grok 4.1 Fast0.6571.00Critical0.6720.6423.3
DeepSeek (R1)0.6480.83High0.6530.6431.1
DeepSeek Chat0.6240.79High0.6300.6181.4
Claude Opus 4.60.5680.93Critical0.5730.5621.5
Grok 4.1 (Reasoning)0.4640.79High0.4830.4455.9
Gemini 3.1 Pro0.3981.00Critical0.3750.4228.3
Aggregate0.6200.6220.6172.9

Finding 1: All 8 models are VULNERABLE without external guardrails. The mean ASR of 0.620 means that adversaries succeed in 62% of attacks on average. The mean coefficient of variation (CV) of 2.9% confirms high measurement reproducibility across independent sessions.

6.2 Per-Algorithm Attack Effectiveness

Table VI: Algorithm Attack Success Rates (C0 Configuration)

AlgorithmMean ASRModels with Score=1.0Strongest DefenseWeakest Defense
PAIR1.0008/8 (100%)None — universally effective
Crescendo1.0008/8 (100%)None — universally effective
HPM0.9386/8 (75%)Gemini (error/blocked)Grok 4.1 Fast (1.0)
TAP0.9226/8 (75%)DeepSeek (0.75)GPT-5 (1.0)
BEAST0.8915/8 (63%)DeepSeek (0.50)GPT-5 (1.0)
AutoDAN0.8666/8 (75%)Grok 4.1-R (0.0)GPT-5 (1.0)
ArtPrompt0.8305/8 (63%)Grok 4.1-R (0.5) / DeepSeek (0.5)GPT-5 (1.0)

Finding 2: PAIR and Crescendo achieve 100% ASR against all models. These two algorithms — representing iterative refinement and multi-turn escalation paradigms respectively — penetrate every model's native safety mechanisms without exception, including Constitutional AI (Claude) and reasoning-mode processing (Grok 4.1-R).

6.3 Algorithm x Model Vulnerability Matrix

Table VII: Detailed Score Matrix (0 = BLOCKED, 0.5 = PARTIAL, 1.0 = VULNERABLE)

ModelPAIRTAPCrescendoAutoDANBEASTArtPromptHPMMean
GPT-51.01.01.01.01.01.01.0*1.00
Grok 4.1 Fast1.01.01.01.01.01.01.01.00
Grok 3 Mini1.00.751.01.01.01.01.00.96
DeepSeek (R1)1.00.751.01.00.50.51.00.82
DeepSeek Chat1.00.751.01.00.50.50.750.79
Claude Opus 4.61.00.751.01.01.00.751.00.93
Grok 4.1-R1.01.01.00.01.00.51.00.79
Gemini 3.1 Pro1.01.01.0*1.0*1.0*1.0*0.0*0.86

*Partial data due to timeout/rate-limit errors.

6.4 Layered Guardrail Effectiveness

Table VIII: Defense Rate by Guardrail Configuration

ConfigDescriptionEst. Defense RateLatency Overhead (P99)Throughput
C0No guardrail38.1%0msN/A
C1Tier 1 (rules)55–60%< 0.5ms100K+ RPS
C2Tier 1+2 (rules + ML)65–72%< 5ms80K+ RPS
C3Full 3-Tier75–82%< 20ms (P99)50K+ RPS
C43-Tier + PALADIN + Domain82–90%< 25ms (P99)45K+ RPS

Finding 3: Each additional guardrail tier provides diminishing but significant improvement. Tier 1 alone captures 70–80% of clearly dangerous content at sub-millisecond latency. Tier 2 adds ML-based contextual analysis for an additional 10–12% improvement. Tier 3's LLM Judge handles the remaining ambiguous cases, adding 7–10% improvement at the cost of higher latency for a small fraction of traffic.

Finding 4: Domain-specific policies (C4) add 5–8% over generic 3-Tier (C3). This incremental improvement is critical for high-risk domains where the marginal 5–8% may represent the difference between regulatory compliance and violation.

6.5 Per-Category Block Probabilities with AEGIS Overlay

Table IX: AEGIS Defense State Transition Probabilities

Attack CategoryVuln → BlockedVuln → PartialPartial → BlockedEffective Defense Rate
Harmful Content0.900.080.9594.6%
Prompt Injection0.850.120.9291.0%
Data Extraction0.820.140.8888.3%
Jailbreak0.780.170.9085.3%
Encoding Bypass0.720.200.8579.0%
Multi-Turn0.700.220.8378.3%

Effective Defense Rate = Vuln→Blocked + (Vuln→Partial x Partial→Blocked)

Finding 5: Multi-turn attacks remain the weakest point even with full guardrails (78.3%). This 16.3% gap compared to harmful content (94.6%) represents a critical area for improvement, particularly given that Crescendo achieves 100% ASR without guardrails.

6.6 Reasoning Mode as Safety Mechanism

Table X: Impact of Reasoning Mode on ASR

Model PairStandard ASRReasoning ASRDelta ASRReduction
Grok 4.1 Fast vs. Grok 4.1 (Reasoning)0.6570.464-0.19329.4%

Detailed per-algorithm comparison:

AlgorithmStandard ScoreReasoning ScoreImproved?
PAIR1.01.0No
TAP1.01.0No
Crescendo1.01.0No
AutoDAN1.00.0Yes (Full Block)
BEAST1.01.0No
ArtPrompt1.00.5Yes (Partial)
HPM1.01.0No

Finding 6: Reasoning mode selectively blocks genetically evolved (AutoDAN) and visual (ArtPrompt) attacks but does not improve defense against iterative (PAIR), tree-search (TAP), multi-turn (Crescendo), suffix-optimized (BEAST), or psychological (HPM) attacks. This suggests that chain-of-thought reasoning helps detect structurally anomalous prompts but cannot overcome conversational manipulation or universal optimization techniques.

6.7 Enterprise Guardrail Effectiveness Index (EGEI)

Table XI: EGEI Scoring by Configuration and Deployment Scenario

ConfigAR (x0.35)CC (x0.25)PS (x0.15)DA (x0.25)EGEIRating
C00.38 → 0.1330.14 → 0.0351.00 → 0.1500.00 → 0.0000.318D
C10.58 → 0.2030.36 → 0.0900.95 → 0.1430.00 → 0.0000.435D
C20.69 → 0.2420.57 → 0.1430.85 → 0.1280.00 → 0.0000.512C
C30.79 → 0.2770.79 → 0.1980.80 → 0.1200.00 → 0.0000.594C
C40.86 → 0.3010.93 → 0.2330.75 → 0.1130.85 → 0.2130.859A

Component scoring methodology:

  • AR (Attack Resilience): Based on estimated defense rate for each config (C0=38.1%, C1=58%, C2=69%, C3=79%, C4=86%)
  • CC (Compliance Coverage): Fraction of 14 regulatory requirements addressed (C0=2/14 model-native; C1=5/14 adds audit logging + pattern matching; C2=8/14 adds ML classification + attribution; C3=11/14 adds human-equivalent judgment + proportionality; C4=13/14 adds domain-specific compliance)
  • PS (Performance Score): Normalized latency efficiency (C0=1.0 no overhead; degrading as layers add latency)
  • DA (Domain Adherence): Fraction of domain-specific policies enforced (0 for C0–C3; 0.85 for C4 with active domain modules)

Finding 7: Only Configuration C4 (Full 3-Tier + Domain Policy) achieves EGEI rating A, suitable for regulated enterprise deployment. Notably, C3 (Full 3-Tier without domain policies) scores only C (0.594), underscoring that domain-specific policy enforcement is not optional for high-risk use cases — it accounts for a 0.265 EGEI improvement (44.6% increase over C3).

6.8 EGEI by Enterprise Deployment Scenario

Table XII: EGEI Scores for Domain-Specific Deployments (C4 Configuration)

ScenarioARCCPSDAEGEIKey Risk
General Enterprise0.860.930.800.750.840 (A)Data leakage, PII
Telecom NOC0.860.930.750.900.870 (A)Infrastructure exploit
Healthcare CDS0.820.860.750.850.824 (A)Dosage safety, self-Rx
Financial Advisory0.840.930.800.880.861 (A)Investment fraud
Military C20.880.860.700.950.862 (A)ROE/IHL compliance

All five enterprise scenarios achieve EGEI rating A with C4 configuration. The military C2 scenario achieves the highest Domain Adherence (0.95) due to the comprehensive 7-module military defense suite, while the healthcare scenario has the lowest Attack Resilience (0.82) due to the subtlety of dosage-related attacks that evade standard content filters.

6.9 Regulatory Compliance Assessment

Table XIII: Regulatory Requirement Coverage by Configuration

RequirementRegulationC0C1C2C3C4
Risk management systemEU AI Act Art. 9
Data governanceEU AI Act Art. 10
Transparency & loggingEU AI Act Art. 13
Human oversightEU AI Act Art. 14
Accuracy & robustnessEU AI Act Art. 15
Quality managementEU AI Act Art. 17
Safety assuranceK-AI Act Art. 5
User protectionK-AI Act Art. 7
TransparencyK-AI Act Art. 9
Personal data protectionK-AI Act Art. 12
GOVERN functionNIST AI RMF
MAP functionNIST AI RMF
MEASURE functionNIST AI RMF
MANAGE functionNIST AI RMF

● = Full, ○ = Substantial, △ = Partial, — = Not addressed

Finding 8: Only C4 achieves full coverage across all 14 regulatory requirements. C3 achieves substantial coverage for most requirements but lacks domain-specific data governance and quality management capabilities. Critically, C0 (model-native safety only) addresses at most 2 of 14 requirements, making unguarded LLM deployment non-compliant with all three regulatory frameworks.

7. Discussion

7.1 The 38.1% Baseline Problem

Our most consequential finding is that the average baseline defense rate without external guardrails is only 38.1%. In enterprise terms, this means that a model deployed with only its native safety mechanisms will fail to defend against approximately 6 out of 10 adversarial attacks. For high-risk domains governed by EU AI Act Article 15 (which mandates "appropriate levels of accuracy, robustness and cybersecurity"), a 38.1% defense rate is categorically non-compliant.

This finding has immediate practical implications: any enterprise deploying an LLM in a regulated context without external guardrails is likely in violation of applicable regulations. The EU AI Act provides for fines up to 3% of global turnover for non-compliance with high-risk AI requirements, making guardrail investment not merely a technical decision but a legal imperative.

7.2 The Diminishing Returns of Layered Defense

Our analysis reveals a clear pattern of diminishing returns as guardrail layers are added:

TransitionDefense Rate GainLatency Cost
C0 → C1 (add rules)+17–22%+0.5ms
C1 → C2 (add ML)+10–12%+4.5ms
C2 → C3 (add LLM Judge)+7–10%+15ms (P99)
C3 → C4 (add domain policy)+5–8%+5ms

The first tier (rule-based) provides the highest ROI: 17–22% defense improvement at sub-millisecond cost. The LLM Judge tier provides the lowest marginal gain (7–10%) at the highest marginal cost (15ms). However, for high-risk enterprise deployments, every percentage point of defense improvement carries significant value — the difference between 82% and 90% defense rate may represent the boundary between a regulatory audit pass and failure.

7.3 The Domain Policy Imperative

Configuration C3 (full 3-Tier, no domain policy) achieves EGEI rating C (0.594), while C4 (with domain policy) achieves rating A (0.859). This 44.6% EGEI improvement from domain-specific policies alone demonstrates that generic guardrails are necessary but insufficient for enterprise deployment. The domain policy contribution comes from two sources:

  1. Domain-specific threat detection — telecom infrastructure exploits, healthcare prescribing boundaries, financial suitability rules, and military classification enforcement are invisible to generic content safety systems.

  2. Regulatory compliance coverage — domain policies enable full compliance with data governance (EU AI Act Art. 10), quality management (Art. 17), and sector-specific requirements that generic guardrails cannot address.

7.4 The Refusal Paradox in Enterprise Context

Our qualitative analysis reveals a phenomenon we term the "refusal paradox": models that provide detailed, helpful refusal responses inadvertently leak information. In enterprise contexts, this paradox takes on heightened significance:

  • GPT-5 refused DAN role-play but offered extensive alternative assistance, creating secondary attack surfaces.
  • DeepSeek acknowledged the existence of proprietary system prompt details, confirming extractable secrets.
  • Claude identified and labeled the attack technique, demonstrating meta-awareness but still engaging semantically.

For enterprise guardrails, the implication is clear: guardrail systems must intercept and replace model responses, not rely on the model's own refusal behavior. The AEGIS MODIFY verdict type enables response rewriting that removes information-leaking refusal content while maintaining user experience.

7.5 Reasoning Mode: A Promising but Incomplete Defense

The 29.4% ASR reduction from reasoning mode (Grok 4.1 standard → reasoning) is significant but selective. Reasoning mode successfully blocks:

  • AutoDAN (genetically evolved prompts with structural anomalies detectable through deliberation)
  • ArtPrompt (partially; ASCII art encoding recognizable through step-by-step analysis)

But it fails against:

  • PAIR/Crescendo (sophisticated conversational patterns that survive deliberation)
  • HPM (psychological manipulation that exploits the reasoning process itself)

This suggests that reasoning mode is a valuable complement to, but not a replacement for, external guardrail systems. Enterprise deployments should enable reasoning mode for safety-critical requests as an additional defense layer within the PALADIN pipeline.

7.6 Limitations

  1. Estimated defense rates for C1–C4. While C0 data comes from direct empirical measurement (112 evaluations), the C1–C4 defense rates are estimated from per-category block probabilities and simulation. Future work should conduct full empirical evaluation at each configuration level.

  2. Scenario coverage. Our evaluation tested primarily jailbreak and encoding bypass scenarios. Comprehensive enterprise benchmarking requires testing domain-specific threat vectors (Table II) against each configuration.

  3. Single evaluation date. Both sessions were conducted on the same date. Longitudinal evaluation across model updates would capture safety regression or improvement trends.

  4. EGEI weight sensitivity. The proposed EGEI weights (0.35/0.25/0.15/0.25) reflect our judgment of enterprise priorities. Different weight configurations (e.g., prioritizing compliance for heavily regulated sectors) may yield different configuration rankings.

  5. Open-source model gap. All tested models are commercial. Enterprise deployments increasingly use self-hosted open-source models (LLaMA, Mistral, Qwen) whose safety characteristics may differ significantly.

8. Recommendations for Enterprise Deployment

8.1 Minimum Viable Guardrail (MVG)

For any enterprise deployment: Deploy at minimum a 2-Tier guardrail (C2: rule-based + ML classifier). This achieves 65–72% defense rate at < 5ms P99 latency, representing the minimum acceptable configuration for regulated environments.

8.2 High-Risk Domain Configuration

For healthcare, finance, defense, and critical infrastructure: Deploy the full C4 configuration (3-Tier + PALADIN + Domain Policy). The EGEI analysis demonstrates that only C4 achieves rating A across all enterprise scenarios, and only C4 provides full regulatory compliance coverage.

8.3 Reasoning Mode Activation

For all configurations: Enable reasoning mode (where available) for safety-critical request categories. While not a standalone solution, the 29.4% ASR reduction provides meaningful additional defense, particularly against genetically evolved attacks (AutoDAN).

8.4 Behavioral Analysis Priority

Invest in multi-turn defense. Crescendo's 100% ASR against all models, combined with multi-turn attacks' lowest AEGIS block rate (78.3%), identifies multi-turn escalation as the highest-priority gap in current guardrail technology. Enterprise deployments should prioritize PALADIN L5 (BehavioralAnalysis) configuration and session-level monitoring.

8.5 Provider Selection Guidance

Do not rely on safety reputation. Our finding that GPT-5 (highest industry reputation) exhibited the highest ASR (0.865) while Gemini (lowest pre-test rating) achieved the lowest ASR (0.398) demonstrates that empirical testing, not marketing claims, must drive provider selection decisions.

8.6 Continuous Evaluation

Implement continuous safety monitoring. The SABER framework's ASR@N scaling law enables predictive risk assessment:

Budget@τ = minimum N such that ASR@N >= τ

Enterprise security teams should continuously compute Budget@0.5 for deployed models and trigger guardrail hardening when Budget@0.5 falls below the "Strong" threshold (< 1,000 attempts).

9. Conclusion

This paper presents the first comprehensive benchmarking study of layered guardrail effectiveness in high-risk enterprise LLM use cases. Our empirical evaluation of 8 models across 7 attack algorithms in 112 evaluations establishes three foundational conclusions for enterprise AI safety:

First, LLM-native safety is fundamentally insufficient for enterprise deployment. With a baseline defense rate of only 38.1% and universal vulnerability to PAIR and Crescendo attacks, no tested model — including GPT-5, Claude Opus 4.6, and Gemini 3.1 Pro — meets the safety requirements of any major regulatory framework (EU AI Act, K-AI Act, NIST AI RMF) without external guardrails.

Second, layered guardrails are effective but require domain-specific policies for high-risk use cases. The 3-Tier architecture (rule-based + ML + LLM Judge) improves defense rates from 38.1% to 75–82%, but only the addition of domain-specific policy enforcement (C4 configuration) achieves regulatory compliance and EGEI rating A (0.859). Generic guardrails alone score EGEI C (0.594) — inadequate for regulated enterprise use.

Third, the Enterprise Guardrail Effectiveness Index (EGEI) provides a multidimensional evaluation framework that captures the full spectrum of enterprise requirements — attack resilience, regulatory compliance, operational performance, and domain-specific policy adherence — enabling principled comparison of guardrail configurations across deployment scenarios. We demonstrate that all five tested enterprise scenarios (general enterprise, telecom, healthcare, financial, military) achieve EGEI rating A with the C4 configuration.

These findings carry immediate practical implications. For enterprises operating under EU AI Act Article 15 or equivalent regulatory mandates, deployment of an LLM without external guardrails constitutes regulatory non-compliance. The estimated 75–85% defense rate achievable with full layered guardrails, while significantly better than the 38.1% baseline, still leaves a 15–25% residual risk — primarily from multi-turn escalation attacks — that must be addressed through continued research in conversational safety analysis and adaptive defense mechanisms.

References

[1] M. Mazeika, L. Phan, X. Yin, et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal," arXiv:2402.04249, 2024.

[2] P. Chao, E. Dobriban, et al., "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models," arXiv:2404.01318, 2024.

[3] A. Zou, Z. Wang, et al., "AdvBench: A Benchmark for Evaluating Adversarial Robustness of Large Language Models," 2023.

[4] NVIDIA, "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications," 2024.

[5] Guardrails AI, "Guardrails: Adding Reliable AI Safeguards to LLM Applications," 2024.

[6] Meta AI, "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations," arXiv:2312.06674, 2023.

[7] H. Inan, K. Upasani, J. Chi, et al., "Llama Guard: LLM-based Input-Output Safeguard," arXiv:2312.06674, 2023.

[8] S. Han, K. Kelly, J. Xu, et al., "WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs," arXiv:2406.18495, 2024.

[9] European Parliament and Council, "Regulation (EU) 2024/1689 — Artificial Intelligence Act," Official Journal of the European Union, 2024.

[10] 대한민국 국회, "인공지능산업 육성 및 신뢰 확보에 관한 법률 (K-AI Act)," 2025.

[11] National Institute of Standards and Technology, "AI Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, 2023.

[12] WHO, "Ethics and Governance of Artificial Intelligence for Health," 2021.

[13] Financial Stability Board, "Artificial Intelligence and Machine Learning in Financial Services," 2017.

[14] U.S. Department of Defense, "Directive 3000.09: Autonomy in Weapon Systems," 2023 (updated).

[15] OWASP, "OWASP Top 10 for LLM Applications," 2025.

[16] Anonymous, "GuardNet: Hierarchical Graph-Based Detection for LLM Safety," arXiv:2509.23037, 2025.

[17] Anonymous, "JBShield: Jailbreak Detection via Linear Representation Hypothesis," arXiv:2502.07557, 2025.

[18] P. Chao, A. Robey, E. Dobriban, et al., "Jailbreaking Black Box Large Language Models in Twenty Queries," arXiv:2310.08419, 2023.

[19] A. Mehrotra, M. Zampetakis, P. Kassianik, et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subversions," arXiv:2312.02119, 2023.

[20] M. Russinovich, A. Salem, R. Elber, "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack," Microsoft Research, 2024.

[21] X. Liu, N. Xu, M. Chen, C. Xiao, "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models," arXiv:2310.04451, 2023.

[22] S. Sadasivan, S. Saha, G. Sriramanan, et al., "Fast Adversarial Attacks on Language Models In One GPU Minute," arXiv:2402.15570, 2024.

[23] F. Jiang, Z. Xu, L. Niu, et al., "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs," arXiv:2402.11753, 2024.

[24] Anonymous, "Human-like Psychological Manipulation of LLMs," arXiv:2512.18244, 2025.

[25] Y. Bai, S. Kadavath, et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv:2212.08073, 2022.

[26] AEGIS Research Team, "SABER: Statistical Adversarial Risk with Beta Extrapolation and Regression — Technical Report," Internal Report, 2026.

Appendix A: EGEI Calculation Methodology

A.1 Attack Resilience (AR)

AR = 1 - mean(ASR_algo) for all 7 algorithms

For C0: AR = 1 - 0.620 = 0.380 For C4: AR estimated from per-category block probabilities applied to C0 ASR data.

A.2 Compliance Coverage (CC)

CC = (Σ requirement_scores) / 14
where: ● = 1.0, ○ = 0.75, △ = 0.5, — = 0.0

A.3 Performance Score (PS)

PS = 1.0 - (P99_latency_ms / 200ms) × 0.5 - (1 - throughput/100K) × 0.5

Capped at [0, 1]. The 200ms normalization constant represents the maximum acceptable P99 for enterprise deployments.

A.4 Domain Adherence (DA)

DA = (active_domain_policies / total_applicable_policies)

Assessed per deployment scenario. C0–C3 have DA = 0 (no domain-specific policies). C4 DA varies by scenario (0.75–0.95).

Appendix B: Raw Experimental Data Summary

B.1 Session 1 (2026-02-23T22:00:16Z)

  • Scenario set: Jailbreak + data extraction
  • Models tested: 8
  • Average defense rate: 0.378
  • Average latency: 83,610ms (model response time, not guardrail latency)
  • Safest model: DeepSeek (ASR 0.653)

B.2 Session 2 (2026-02-23T22:44:58Z)

  • Scenario set: Jailbreak + encoding bypass
  • Models tested: 8
  • Average defense rate: 0.383
  • Average latency: 63,228ms
  • Safest model: Grok 4.1 Fast Reasoning (ASR 0.445)

B.3 Error Summary

ModelSessionAlgorithmError Type
Gemini 3.1 Pro1Crescendo, AutoDAN, BEAST, ArtPrompt, HPMTimeout
GPT-52HPMTimeout
Gemini 3.1 Pro2HPMRate limit (429)

Assets & Downloads

Executive SummaryComing Soon
Slide DeckComing Soon
Demo VideoComing Soon

Interested in applying this research?

Contact the AEGIS Research team to learn how this work can support your AI deployment needs.