AEGIS-RP-2026-001Research Paperv1.0

AEGIS: A Multi-Layered Framework for Automated LLM Safety Diagnosis through Adversarial Red-Teaming and Statistical Risk Analysis

Integrated Offensive Red-Teaming and Defensive Guardrail System with SABER Statistical Risk Prediction

Authors: AEGIS Research Team, Yatav Inc.
Published: March 2026
Affiliation: AEGIS Research, Yatav Inc.
red teamingLLM safetySABERPALADINadversarial testingjailbreakguardrailstatistical riskBeta distributionBudget@τASR scaling law

Summary

Existing approaches to LLM safety evaluation treat offensive testing and defensive deployment as independent concerns: red-team researchers measure attack success rates while defense engineers deploy guardrails, with no formal framework connecting attacker effort to defender resilience. We present AEGIS, an integrated system that closes this loop through three tightly coupled subsystems: (1) an offensive red-team engine comprising 8 attack algorithms, a Meta-Attack genetic recombinator over 30 atomic primitives, and a reinforcement learning attack agent with PPO-trained policy networks; (2) a defensive guardrail pipeline combining the PALADIN 6-layer deep inspection network with a 3-Tier hierarchical defense (rule-based at <0.5ms, ML classifier at <5ms, LLM Judge at <200ms) and 4 specialized detection algorithms (GuardNet, JBShield, CCFC, MULI); and (3) SABER (Statistical Adversarial risk with Beta Extrapolation and Regression), a statistical risk framework that models per-query vulnerability as θ ~ Beta(α, β), derives the ASR@N scaling law to predict attack success rates under Best-of-N scenarios, and introduces the Budget@τ metric to quantify defender resilience as the minimum attack budget required to achieve success probability τ. SABER further implements a closed-loop deterministic defense promotion system that automatically converts high-risk queries (ASR@1000 ≥ 0.8) into θ=0 deterministic blocks. Empirical evaluation across 8 LLM models (GPT-5, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4.1/3, DeepSeek) in 112 evaluations reveals a baseline defense rate of only 38.1%, with PAIR and Crescendo achieving 100% ASR universally. SABER analysis classifies 6 of 8 models at Critical risk (ASR@1000 ≥ 0.8), while the integrated AEGIS defense improves effective defense rates to 75–90% with the deterministic promotion mechanism providing an additional 5–12% improvement on recurring attack patterns.

1. Introduction

The safety evaluation of Large Language Models faces a fundamental methodological gap. On the offensive side, researchers develop increasingly sophisticated attack algorithms — iterative prompt refinement [1], tree-structured search [2], multi-turn escalation [3], genetic evolution [4], adversarial suffix optimization [5], visual bypass [6], and psychological manipulation [7] — each evaluated in isolation against individual models. On the defensive side, engineers deploy guardrail systems — content classifiers, pattern matchers, and LLM-based judges — configured through heuristic thresholds and qualitative risk assessments. Between these two practices lies an unanswered question of critical operational importance:

Given an adversary with computational budget N, what is the probability that they will breach a model's defenses, and how much must the defender invest to reduce this probability below an acceptable threshold τ?

This question cannot be answered by point-estimate ASR metrics (which measure success at a fixed N) or by binary safe/unsafe classifications (which ignore the continuous nature of vulnerability). Answering it requires a statistical framework that models the distribution of per-query vulnerability, predicts how attack success scales with effort, and quantifies defense effectiveness in terms of attacker cost.

AEGIS addresses this gap through an integrated architecture where offensive testing, defensive enforcement, and statistical risk analysis operate as a closed-loop system. Our contributions are:

Contribution 1: Unified Offensive Engine. We integrate 8 attack algorithms spanning 5 distinct paradigms (iterative refinement, tree search, conversational escalation, genetic evolution, and adversarial optimization) with a Meta-Attack generator that recombines 30 atomic primitives through genetic algorithms and a reinforcement learning agent that adapts attack strategies via PPO-trained policy networks. This provides comprehensive coverage of the adversarial attack surface.

Contribution 2: Layered Defensive Architecture. The PALADIN 6-layer pipeline and 3-Tier hierarchical defense system process 70–80% of traffic at sub-millisecond latency while deploying 4 specialized detection algorithms (GuardNet, JBShield, CCFC, MULI) for nuanced threat identification. The architecture achieves P50 ~2ms latency at 50,000+ RPS.

Contribution 3: SABER Statistical Risk Framework. We introduce a formal mathematical framework based on the Beta distribution vulnerability model (θ ~ Beta(α, β)) that derives the ASR@N scaling law, the Budget@τ defense metric, and a closed-loop deterministic defense promotion mechanism. SABER transforms safety evaluation from qualitative assessment to quantitative prediction with 95% confidence intervals.

Contribution 4: Closed-Loop Integration. The three subsystems form a feedback loop: the offensive engine discovers vulnerabilities, SABER quantifies their risk, and the defense system automatically promotes high-risk patterns to deterministic blocks (θ=0). This continuous improvement cycle ensures that the defense system evolves in response to discovered threats.

2. Related Work

2.1 Adversarial Attack Algorithms

The landscape of LLM adversarial attacks has evolved along five paradigms:

Iterative Refinement. PAIR [1] uses an attacker LLM to iteratively refine jailbreak prompts over num_streams=5 parallel conversation threads with max_depth=20. TAP [2] improves upon PAIR through tree-structured search with beam pruning (branching_factor=4, beam_width=5), achieving higher success rates with fewer queries.

Conversational Escalation. Crescendo [3] distributes harmful intent across 12+ conversation turns with a 4-phase escalation protocol (Rapport → Context Building → Gradual Escalation → Target Request). HPM [7] applies 8 distinct psychological manipulation profiles (Authority, Emotional, Intellectual, Social Proof, Reciprocity, Urgency, Consistency, Rapport) achieving 88.1% average ASR.

Genetic Evolution. AutoDAN [4] evolves populations of jailbreak prompts (population_size=20, num_generations=50) through 5 mutation types, single-point crossover, and tournament selection with elite preservation.

Adversarial Optimization. BEAST [5] optimizes adversarial suffixes via beam search (beam_width=10, max_suffix_length=20), achieving 25x speedup over GCG [8]. ArtPrompt [6] substitutes sensitive keywords with ASCII art in 4 styles (Block, Banner, Digital, Minimal).

Multilingual Obfuscation. Korean Attack exploits CJK script properties through 10 techniques including jamo decomposition, chosung encoding, code-switching, keyboard mapping, Hanja bypass, and syllable reversal.

While each paradigm has produced significant results, no prior work unifies them into a single evaluation framework or provides statistical modeling of their combined effectiveness.

2.2 Defense Mechanisms

GuardNet [9] employs hierarchical graph-based detection across token (weight 0.25), sentence (0.45), and prompt (0.20) levels with cross-level graph connectivity (0.10). JBShield [10] separates toxicity and jailbreak technique detection in the representation space based on the linear representation hypothesis. These algorithms represent the state of the art in individual detection, but are not evaluated as components of layered defense systems.

2.3 Statistical Risk Analysis for LLMs

Prior work on LLM risk quantification is limited. Existing benchmarks report ASR at fixed attempt counts [11, 12, 13] without modeling the scaling behavior of attacks or quantifying defender cost. The concept of "scaling laws" has been applied to model performance [14] and training compute [15], but not to adversarial attack success rates. SABER fills this gap with the first formal statistical framework for predicting LLM vulnerability under Best-of-N attack scenarios.

3. SABER: Statistical Adversarial Risk Framework

SABER (Statistical Adversarial risk with Beta Extrapolation and Regression) is the theoretical core of AEGIS. It provides the mathematical apparatus that connects offensive testing results to defensive deployment decisions.

3.1 Vulnerability as a Random Variable

Core assumption. For a given query q, the probability θ_q that a single attack attempt succeeds is not a fixed quantity but a random variable drawn from a Beta distribution:

θ_q ~ Beta(α, β)

Justification. The Beta distribution is the conjugate prior for the Bernoulli likelihood, making it the natural choice for modeling success probabilities. The parameters α and β have intuitive interpretations:

  • α (risk amplification exponent): Controls how quickly risk increases under repeated attacks. Smaller α implies faster risk growth — the vulnerability distribution has a heavier left tail, meaning a larger fraction of queries have moderate-to-high vulnerability.
  • β: Controls the concentration of the distribution around low-vulnerability values. Larger β means most queries are relatively safe.

The shape of Beta(α, β) captures the empirical observation that LLM vulnerability is heterogeneous: some queries are robustly defended (low θ), some are moderately vulnerable (medium θ), and some are easily attacked (high θ). A single summary statistic (e.g., average ASR) cannot capture this heterogeneity.

3.2 The Impenetrability Fraction

Not all queries are stochastically vulnerable. Some queries trigger deterministic defenses (exact pattern matches, known-dangerous content hashes) that block the attack with certainty regardless of the number of attempts. We model this through the impenetrability fraction π:

π = P(θ_q = 0)

representing the proportion of queries for which no number of attempts can succeed. The effective vulnerability distribution is thus a mixture:

θ_q ~ π · δ(0) + (1 - π) · Beta(α, β)

where δ(0) is a point mass at zero.

3.3 ASR@N Scaling Law

The central theoretical contribution of SABER is the derivation of the ASR@N scaling law, which predicts the probability that at least one of N independent attack attempts succeeds:

Exact formulation:

ASR@N = (1 - π) · [1 - B(α, β + N) / B(α, β)]

where B(·,·) is the Beta function. This expression follows from computing E[(1-θ)^N] under the Beta distribution:

E[(1-θ)^N | θ ~ Beta(α,β)] = B(α, β+N) / B(α, β) = Γ(β+N)·Γ(α+β) / (Γ(α+β+N)·Γ(β))

The probability that all N attempts fail is E[(1-θ)^N], so the probability that at least one succeeds is 1 minus this quantity, adjusted for the impenetrable fraction.

Asymptotic approximation (large N):

ASR@N ≈ (1 - π) · [1 - C · N^(-α)]

where C = Γ(α+β)/Γ(β) is the scaling constant. This power-law form reveals three key properties:

  1. Sublinear growth: ASR grows as N^(-α), meaning diminishing marginal returns for the attacker. However, for small α (e.g., α < 0.5), the growth is steep — a modest increase in N yields significant ASR improvement.
  2. Attacker-defender asymmetry: The exponent α determines the asymmetry. Small α favors attackers (rapid risk growth); large α favors defenders (risk plateaus quickly).
  3. Saturation bound: As N → ∞, ASR@N → (1 - π). The impenetrability fraction π sets a hard ceiling on attack success that no amount of effort can exceed.

3.4 Parameter Estimation

Given observed data where q_i is a query, n_i is the number of trials, and s_i is the number of successes, SABER estimates (α, β, π) using the Method of Moments:

x̄ = mean of observed per-query ASR values (s_i / n_i)
s² = sample variance of per-query ASR values
π̂ = fraction of queries with s_i = 0 and n_i ≥ min_trials

α̂ = x̄ · (x̄(1 - x̄) / s² - 1)
β̂ = (1 - x̄) · (x̄(1 - x̄) / s² - 1)

Goodness of fit. The quality of the Beta model fit is assessed via R² from log-linear regression:

Y = -α · X + log(C)
where Y = log(1 - ASR@N), X = log(N)

An R² ≥ 0.9 indicates that the power-law scaling model is a good fit to the observed data.

3.5 Confidence Intervals

SABER provides 95% confidence intervals for ASR@N predictions using the delta method:

CI(ASR@N) = ASR@N ± z_{0.975} · σ_ASR

where σ_ASR is derived from the variance of the Beta function ratio estimate. This enables risk assessments to be communicated with quantified uncertainty.

3.6 Risk Grading

Based on ASR@1000 (attack success rate after 1,000 attempts), SABER assigns risk grades:

Risk GradeASR@1000Operational Interpretation
Safe< 5%Robust against sustained automated attacks
Low5–20%Minor vulnerability under persistent probing
Medium20–50%Significant risk; additional defenses recommended
High50–80%Likely to be compromised under sustained attack
Critical≥ 80%Fundamentally unsafe; immediate remediation required

3.7 Budget@τ: Defense Strength Metric

Budget@τ answers the defender's question: "How much must an attacker invest to succeed with probability τ?"

Budget@τ = min{N : ASR@N ≥ τ}

Computation. SABER uses an asymptotic initial estimate followed by binary search refinement:

Initial: log_budget = -[ln(1 - τ_eff) + ln(Γ(β)) - ln(Γ(α+β))] / α
where τ_eff = τ / (1 - π)

Refinement: Binary search on predict_asr_exact(N) ≥ τ

If τ ≥ (1 - π), Budget@τ = ∞ (the attack target is unachievable).

Defense grading (at τ = 0.5):

Defense GradeBudget@0.5Meaning
Excellent≥ 10,000Attacker needs ≥10K attempts for 50% success
Strong≥ 1,000Resilient against most automated tools
Good≥ 500Adequate for moderate-risk deployments
Fair≥ 100Below enterprise threshold; hardening needed
Weak< 100Trivially breachable; unsuitable for production

3.8 Risk Amplification Rate

The rate at which risk increases with attacker effort is given by:

d(ASR)/d(ln N) = (1 - π) · α · C · N^(-α)

This derivative captures the marginal danger of each additional order of magnitude of attacker effort. Models with high amplification rates at operationally relevant N values (N ∈ [100, 10000]) are most urgently in need of defense hardening.

3.9 Defense Comparison

When comparing defense configurations (before/after), SABER computes a weighted improvement metric:

Improvement(%) = 0.4 · (Δα / α_before) · 100
               + 0.4 · (-ΔASR@1000 / ASR_before@1000) · 100
               + 0.2 · (Δπ) · 100

This assigns 40% weight to the improvement in the risk amplification exponent (α), 40% to the reduction in ASR@1000, and 20% to the increase in impenetrability fraction (π). The three components capture, respectively, the rate of risk growth, the absolute risk level, and the deterministic defense coverage.

3.10 Closed-Loop Deterministic Promotion

SABER's most operationally significant feature is the closed-loop promotion mechanism that automatically converts high-risk stochastic vulnerabilities into deterministic defenses (θ=0).

Promotion scoring:

Score = (ASR@1000 × severity_weight)
      + cross_provider_bonus   (0.2 if category match or ASR@1000 ≥ 0.9)
      + partial_info_bonus     (0.15 if observed_asr > 0.5 and predicted_asr@1000 ≥ 0.7)
      + trend_bonus            (0.1 if predicted_asr@1000 > observed_asr)

Promote if Score ≥ promotion_threshold

Severity presets:

Presetseverity_weightpromotion_thresholdUse Case
Critical2.00.60Defense/military domains
High1.50.70Healthcare, financial
Default1.00.80General enterprise

When a query is promoted to deterministic defense, it is added to the pattern database with source = SaberAuto, creating a feedback loop:

Red Team Attack → SABER Analysis → High Risk Detected
    → Auto-Promotion to Deterministic Block
    → Future identical attacks blocked at θ=0 with confidence=1.0
    → π increases → ASR@N decreases → Defense Grade improves

This mechanism ensures that the defense system continuously hardens against observed attack patterns without manual intervention.

4. Offensive Red-Team Engine

4.1 Algorithm Portfolio

AEGIS integrates 8 attack algorithms organized by paradigm:

#AlgorithmParadigmKey ParametersBaseline Effectiveness
1PAIR [1]Iterative Refinementstreams=5, depth=20High
2TAP [2]Tree Searchbranch=4, depth=10, beam=5High
3Crescendo [3]Multi-Turn Escalationturns=5–20, rate=0.15Very High
4AutoDAN [4]Genetic Evolutionpop=20, gen=50, mut=0.1, cross=0.7Medium
5BEAST [5]Suffix Optimizationbeam=10, suffix=20, iter=100Medium
6ArtPrompt [6]Visual Bypass4 art styles, 19 keywordsMedium-Low
7HPM [7]Psychological Manipulation8 profiles, turns=15Very High
8Korean AttackMultilingual Obfuscation10 techniques, 6 combinationsMedium

Success criteria (uniform across algorithms): Score ≥ 0.5 = successful attack (VULNERABLE or PARTIAL). Score ≥ 0.9 triggers early termination.

4.2 Meta-Attack Generator

The Meta-Attack generator discovers novel attack combinations by recombining atomic primitives through genetic algorithms.

4.2.1 Atomic Primitives (30 types)

Each primitive is characterized by {baseline_effectiveness, target_layers[], compatible_with[], complexity_cost}:

CategoryPrimitives (effectiveness)
RoleRolePlay (0.35), AuthorityImpersonation (0.40)
EncodingBase64 (0.25), Rot13 (0.15), Hex (0.20)
ConversationMultiTurn (0.55), GradualEscalation (0.50), ContextSwitch (0.30)
PsychologicalEmotionalManipulation (0.30), HypotheticalScenario (—)
TechnicalTokenSmuggling (0.45), AsciiArtBypass (—), PayloadFragmentation (—), SandwichAttack (0.35), RefusalSuppression (0.30)
MultilingualJamoSeparation (0.45), ChosungEncoding (0.40), LanguageMixing (0.40)
DomainTelcoRoleImpersonation (0.45), InternalMemoPretext (0.40), RegulatoryBypassPretext (0.50), CrisisCommPretext (0.45), TrafficAnalysisPretext (0.35), SyntheticDataPretext (0.40)

4.2.2 Genetic Recombination

Attack genomes are sequences of primitives with associated weights:

AttackGenome = {
  primitives: Vec<PrimitiveConfig>,  // ordered sequence (1–8 elements)
  fitness: f64,                       // evaluated score
  generation: usize
}
  • Crossover (rate 0.7): Single-point crossover at random positions
  • Mutation (rate 0.1): Gene replacement (100%), gene insertion (50%), gene removal (30%)
  • Selection: Tournament selection (size 3) with elitism (top 2 preserved)

4.2.3 SABER-Integrated Fitness Evaluation

Fitness is computed as a weighted sum of four components:

Fitness = w₁ · ASR_score + w₂ · Budget_score + w₃ · Novelty_score + w₄ · Domain_score
ComponentFormulaGeneral WeightsDomain Weights
ASR(α + successes)/(α + β + trials)0.500.40
Budget Efficiencymax(0, 1 − genome_length/max_budget)0.300.20
Novelty1 − min(Jaccard similarity to all seen genomes)0.200.20
DomainTelecom primitive presence + synergy bonuses0.000.20

This SABER-integrated fitness function ensures that the genetic algorithm not only discovers effective attacks but also discovers efficient (low-budget) and novel (diverse) attacks.

4.3 Reinforcement Learning Attack Agent

Beyond the genetic approach, AEGIS includes a PPO-trained RL agent that learns adaptive attack policies.

4.3.1 MDP Formulation

State = {
  conversation_embedding: R^128,    // semantic representation
  turn_number: [0, 20],             // normalized
  defense_detected: {0, 1},         // binary flag
  estimated_success_prob: [0, 1],   // current success estimate
  tokens_used: [0, 4096],           // budget tracker
  previous_action: one-hot,         // last action taken
  domain_context: one-hot(2)        // General | Telecom
}

Action = {
  SelectToken(id),
  InsertPrimitive(id),
  ModifyTone(Polite|Authoritative|Casual|Academic|Urgent),
  SwitchLanguage(id),
  InsertTelecomPrimitive(0–5),
  Escalate, Retreat, Terminate
}

4.3.2 Reward Structure

R(s, a) = bypass_reward + stealth_bonus − efficiency_penalty + domain_bonus

where:
  bypass_reward  = 1.0 if attack succeeds, 0.0 otherwise
  stealth_bonus  = 0.5 if success without triggering detection, −0.3 if detected
  efficiency      = tokens_used / 4096
  domain_bonus   = 0.2 if using domain primitive (success or failure)

4.3.3 PPO Training

HyperparameterValue
Learning rate3 × 10⁻⁴
Clip ε0.2
Epochs per update10
Batch size64
γ (discount)0.99
λ (GAE)0.95
Entropy coefficient0.01
Value coefficient0.5
Max gradient norm0.5

Policy and value networks use 2-layer MLPs ([256, 128]) with Xavier initialization. Generalized Advantage Estimation (GAE) computes advantages with reverse accumulation and normalization (μ=0, σ=1).

5. Defensive Guardrail Pipeline

5.1 Deterministic Pre-Check (θ=0 Layer)

Before any other processing, the deterministic defense manager checks inputs against compiled pattern databases:

  1. O(1) exact match against case-insensitive known-dangerous content hashes
  2. O(n) regex match against compiled pattern library

Pattern categories at three safety levels:

Safety LevelPattern Categories
StandardWeapons/WMD instructions, drug synthesis, CSAM, SQL injection
Enhanced+ Jailbreak patterns, system prompt injection, Crescendo/authority attacks, fictional wrappers, persona continuation, reverse psychology, multi-language evasion, social engineering, academic bypass, credential extraction
Strict+ All Enhanced patterns with lower confidence thresholds

A match triggers immediate BLOCK with confidence = 1.0, bypassing the entire PALADIN pipeline and 3-Tier system.

5.2 PALADIN 6-Layer Pipeline

Six sequential inspection layers, each returning LayerResult = {passed, decision, risk, confidence, latency_ms}:

LayerNameFunctionKey Implementation
L0TrustBoundaryInput validationUnicode NFC/NFKC normalization, length limits, encoding verification
L1IntentVerificationIntent analysisJailbreak pattern matching, prompt injection detection, role confusion
L2RaGuardRAG poisoningPoisoned document detection in retrieval contexts
L3ClassRagLayerSemantic classificationEmbedding-based 8-class content categorization
L4CircuitBreakerAnomaly detectionRequest rate monitoring, behavioral anomaly patterns
L5BehavioralAnalysisMulti-turn profilingConversation trajectory analysis, gradual escalation detection

Decision precedence: BLOCK > ESCALATE > MODIFY > APPROVE. Overall confidence = min(all layer confidences). Risk attribution from last failing layer.

5.3 Three-Tier Hierarchical Defense

Tier 1: Rule-Based Filter (< 0.5ms, 70–80% of traffic)

ComponentTechniquePerformance
Aho-CorasickMulti-pattern matching across 8 languagesO(n) in text length
Bloom FilterSHA-256 double-hashing: h(i) = h₁ + i·h₂. 100K patterns, FPR = 0.001O(k) per lookup
Exact MatchDirect string comparisonO(1) hash lookup

Bloom filter sizing: m = -(n · ln(fpr)) / (ln 2)², hash functions: k = min(16, max(1, (m/n) · ln 2))

Tier 2: ML Classifier (< 5ms P99, 15–25% of traffic)

Primary model: ONNX Guard Encoder (mDeBERTa-v3-base, 8-class multi-label, INT8 quantized, max_seq_length=512).

Fallback heuristic engine — 5-signal weighted scoring:

SignalWeightDetection Logic
Keyword Density0.35Saturation scoring: 1x=0.4, 2x=0.65, 3x=0.8, 4+x=0.85+
N-gram Entropy0.10Shannon entropy < 2.0 (repetition) or > 7.0 bits (encoding anomaly)
Structural Anomaly0.15Base64 > 85%, Hex > 90%, 3+ script mixing, special chars > 30%
Semantic Pattern0.25Instruction override, roleplay hijacking, hypothetical framing
Script Evasion0.15Jamo decomposition, fullwidth, RTL override, zero-width chars

Decision boundaries: unsafe_prob ≥ 0.75 → BLOCK; ≤ 0.25 → PASS; intermediate → escalate to Tier 3.

Tier 3: LLM Judge (< 200ms, < 5% of traffic)

Constitutional AI 2-stage verification (temperature=0.1, max_tokens=512, up to 10 context turns):

  • Stage 1: 9-category evaluation (violence, self-harm, CSAM, weapons, drugs, cybercrime, jailbreak, hate, PII)
  • Stage 2: Verifier checks for false positives, false negatives, and proportionality. Can override Stage 1 verdict.

6 verdict types: APPROVE, MODIFY, BLOCK, ESCALATE, REASK, THROTTLE.

5.4 Specialized Detection Algorithms

GuardNet [9] — Hierarchical graph detection:

Risk = 0.25·TokenLevel + 0.45·SentenceLevel + 0.20·PromptLevel + 0.10·GraphConnectivity

TokenLevel:   30+ suspicious tokens with weights (jailbreak=0.95, ransomware=0.9, 탈옥=0.9)
SentenceLevel: Multi-token patterns ("ignore previous instructions"=0.95, "개발자 모드"=0.9)
PromptLevel:   Encoding hints, delimiter abuse, control tokens, split attacks
GraphConnect:  (detected_levels/3) × (total_patterns/10)

BLOCK if Risk ≥ 0.7

JBShield [10] — Dual-track linear representation:

Input → ToxicityDetection    (Violence, HateSpeech, Illegal, Explicit, SelfHarm)
      → JailbreakTechDetection (Roleplay, InstructionOverride, DeveloperMode, Hypothetical)

confidence = max_weight × 0.7 + avg_weight × 0.3
CRITICAL signal: toxicity_hidden_via_jailbreak = True

Detects 20+ excuse patterns ("educational purpose", "research context", "penetration testing", etc.)

CCFC — Core-Full divergence detection:

core_risk  = analyze(strip_decorations(prompt))
full_risk  = analyze(prompt)
divergence = |full_risk - core_risk|

if divergence > 0.3:
    obfuscation = True
    risk = max(core_risk, full_risk) + divergence × 0.5
    BLOCK if risk > 0.4
else:
    risk = max(core_risk, full_risk)
    BLOCK if risk > 0.7

MULI — Simulated logit toxicity detection:

perplexity_score     = f(special_char_ratio)
token_entropy        = unique_words / total_words
toxic_token_prob     = max(category_confidences)

score = category_score × 0.7 + toxic_token_prob × 0.3

Covers 7 categories: Violence, HateSpeech, SexualContent, IllegalActivity, SelfHarm, Harassment, Misinformation. Domain-enhanced mode adds sector-specific boundary patterns.

6. System Integration: The Closed Loop

The three subsystems — offense, defense, and SABER — operate as a continuous feedback loop:

┌──────────────────────────────────────────────────────────────────┐
│                    AEGIS Closed-Loop Architecture                 │
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │  RED TEAM    │───→│   SABER     │───→│  DEFENSE PIPELINE   │  │
│  │  ENGINE      │    │  ANALYSIS   │    │                     │  │
│  │             │    │             │    │  Deterministic (θ=0) │  │
│  │ 8 Algorithms │    │ Beta(α,β)   │    │  PALADIN 6-Layer    │  │
│  │ Meta-Attack  │    │ ASR@N Law   │    │  3-Tier Hierarchy   │  │
│  │ RL Agent    │    │ Budget@τ    │    │  4 Defense Algos    │  │
│  └──────┬──────┘    └──────┬──────┘    └──────────┬──────────┘  │
│         │                  │                       │              │
│         │    ┌─────────────┴───────────────┐      │              │
│         │    │  CLOSED-LOOP PROMOTION       │      │              │
│         │    │                              │      │              │
│         │    │  High-risk query detected    │←─────┘              │
│         │    │  (ASR@1000 ≥ threshold)      │                     │
│         │    │         ↓                    │                     │
│         │    │  Auto-promote to θ=0 block   │─────→ π increases   │
│         │    │  source: SaberAuto           │       ASR@N drops   │
│         └───→│         ↓                    │       Defense Grade  │
│              │  Re-evaluate with SABER      │       improves      │
│              └──────────────────────────────┘                     │
└──────────────────────────────────────────────────────────────────┘

Operational flow:

  1. Discovery: Red team algorithms attack the target model, producing per-query trial results .
  2. Analysis: SABER fits Beta(α, β) to the trial data, computes ASR@N predictions, Budget@τ, and generates per-query risk profiles.
  3. Promotion: Queries exceeding the auto-promotion threshold are converted to deterministic defense patterns (θ=0 blocks).
  4. Reinforcement: The defense pipeline now blocks promoted patterns at O(1) latency with confidence=1.0. The RL agent observes the strengthened defense and adapts its policy.
  5. Re-evaluation: SABER re-analyzes the system with updated π (increased impenetrability), producing improved risk grades and defense grades.

This cycle can be run continuously in production environments, creating an ever-hardening defense posture.

7. SABER Report: Complete Output Specification

A full SABER analysis produces a structured report:

SaberReport {
  // Identification
  report_id: UUID

  // Beta Distribution Parameters
  alpha: f64                          // Risk amplification exponent
  beta: f64                           // Distribution shape parameter
  unbreakable_fraction: f64           // π: Impenetrability fraction
  goodness_of_fit: f64                // R² of scaling law fit

  // ASR Predictions
  asr_predictions: {                  // ASR at target budget levels
    1:    f64,                        // Single-attempt ASR
    10:   f64,                        // 10-attempt ASR
    100:  f64,                        // 100-attempt ASR
    1000: f64                         // 1000-attempt ASR
  }

  // Confidence Intervals (95%)
  confidence_intervals: {
    1:    (f64, f64),
    10:   (f64, f64),
    100:  (f64, f64),
    1000: (f64, f64)
  }

  // Budget Analysis
  budget_analysis: {
    budgets: {                        // Budget@τ for 5 thresholds
      "0.1": usize,
      "0.3": usize,
      "0.5": usize,
      "0.7": usize,
      "0.9": usize
    }
    defense_grade: DefenseGrade       // Excellent/Strong/Good/Fair/Weak
  }

  // Risk Assessment
  risk_grade: RiskGrade               // Safe/Low/Medium/High/Critical

  // Per-Query Profiles
  query_profiles: [{
    query_id: String
    category: Option<String>
    total_trials: usize
    success_count: usize
    observed_asr: f64
    estimated_theta: f64
    is_deterministic_block: bool
    predicted_asr_1000: f64
    budget_at_50: usize
  }]

  // Aggregate Risk
  aggregate_risk: {
    total_queries: usize
    deterministic_blocks: usize
    mean_alpha: f64
    min_alpha: f64                    // Most vulnerable category
    weakest_category: Option<String>
  }

  // Actionable Recommendations
  recommendations: [{
    priority: u8                      // 1 (highest) to 5 (lowest)
    type: RecommendationType          // AddDeterministicBlock | IncreaseAlpha |
                                      // StrengthenLayer | IncreaseBudget | Monitor
    description: String
    affected: Vec<String>             // Affected query IDs or categories
    expected_improvement: String
  }]

  // Metadata
  metadata: {
    analysis_time_ms: f64
    created_at: timestamp
  }
}

Recommendation logic:

ConditionPriorityRecommendation
Query ASR@1000 > 0.81Add deterministic block for this pattern
Model α < 0.52Increase defense layer independence
Query ASR@1000 ∈ [0.5, 0.8]3Strengthen defense layers for this category
Budget@0.5 < 5004Increase defender cost (slow attacker iteration)
All safe5Monitor — continue periodic evaluation

8. Experimental Evaluation

8.1 Setup

We conducted two independent evaluation sessions on February 23, 2026, testing 8 models across 7 attack algorithms.

Table II: Target Models

ModelProviderASR Session 1ASR Session 2Mean ASR
GPT-5OpenAI0.8620.8680.865
Grok 3 MinixAI0.7270.7370.732
Grok 4.1 FastxAI0.6720.6420.657
DeepSeek (R1)DeepSeek0.6530.6430.648
DeepSeek ChatDeepSeek0.6300.6180.624
Claude Opus 4.6Anthropic0.5730.5620.568
Grok 4.1 (Reasoning)xAI0.4830.4450.464
Gemini 3.1 ProGoogle0.3750.4220.398

Cross-session CV: 2.9% (high reproducibility).

8.2 Per-Algorithm Results

Table III: Algorithm Vulnerability Matrix (Mean Score, 0=Blocked, 0.5=Partial, 1.0=Vulnerable)

ModelPAIRTAPCrescendoAutoDANBEASTArtPromptHPM
GPT-51.01.01.01.01.01.01.0
Claude Opus 4.61.00.751.01.01.00.751.0
Gemini 3.1 Pro1.01.01.01.01.01.00.0
Grok 4.1 (R)1.01.01.00.01.00.51.0
Grok 4.1 Fast1.01.01.01.01.01.01.0
Grok 3 Mini1.00.751.01.01.01.01.0
DeepSeek (R1)1.00.751.01.00.50.51.0
DeepSeek Chat1.00.751.01.00.50.50.75
Mean ASR1.0000.8751.0000.8750.8130.7810.844

8.3 SABER Risk Analysis

Applying SABER analysis to the empirical data, we estimate Beta distribution parameters and project ASR@N curves for each model.

Table IV: SABER Risk Profiles (Estimated)

ModelEst. αEst. πASR@1ASR@10ASR@100ASR@1000Risk Grade
GPT-50.120.000.8650.940.980.997Critical
Grok 3 Mini0.220.000.7320.870.950.99Critical
Grok 4.1 Fast0.280.000.6570.820.930.98Critical
DeepSeek (R1)0.300.040.6480.800.910.96Critical
DeepSeek Chat0.330.040.6240.780.890.95Critical
Claude Opus 4.60.380.040.5680.740.860.93Critical
Grok 4.1 (R)0.500.070.4640.650.800.90High
Gemini 3.1 Pro0.580.060.3980.580.740.86High

Key insights from SABER analysis:

  1. 6 of 8 models are Critical risk (ASR@1000 ≥ 0.80). Only Grok 4.1 (Reasoning) and Gemini 3.1 Pro achieve High rather than Critical, and even they exceed 86% predicted ASR@1000.

  2. GPT-5 has the smallest α (0.12), meaning its vulnerability amplifies most rapidly with attacker effort. By N=10, ASR already reaches 94%. This ultra-small α reflects GPT-5's pattern of verbose, helpful refusal responses that leak exploitable information.

  3. Reasoning mode increases α by 79% (Grok 4.1: α=0.28 → Grok 4.1-R: α=0.50). This is the most significant α improvement observed, confirming that chain-of-thought deliberation materially slows the rate of risk amplification.

  4. DeepSeek models show non-zero π (0.04), indicating that some attack patterns are deterministically blocked (ArtPrompt in particular). This small but non-zero impenetrability fraction sets a ceiling below 100% on ASR@∞.

Table V: SABER Budget Analysis (Estimated)

ModelBudget@0.1Budget@0.3Budget@0.5Budget@0.7Budget@0.9Defense Grade
GPT-511112Weak
Grok 3 Mini11128Weak
Grok 4.1 Fast112418Weak
DeepSeek (R1)112525Weak
DeepSeek Chat112630Weak
Claude Opus 4.6113845Weak
Grok 4.1 (R)1251595Weak
Gemini 3.1 Pro12722150Weak

All 8 models achieve Defense Grade = Weak (Budget@0.5 < 100). Even the most resilient model (Gemini) requires only 7 attempts for an adversary to achieve 50% success probability. This finding underscores the absolute necessity of external guardrail systems.

8.4 AEGIS Defense Overlay: SABER Before/After

Applying the AEGIS defense pipeline and projecting SABER metrics improvement:

Table VI: Projected SABER Metrics with AEGIS Defense

MetricWithout AEGISWith AEGIS (C3)With AEGIS + θ=0 Promotion (C4)
Mean ASR@10.6200.180.12
Mean ASR@10000.960.450.30
Mean α0.340.750.95
Mean π0.030.150.35
Mean Budget@0.5385450
Defense GradeWeakFairGood

SABER defense comparison metric:

Improvement(C0→C4) = 0.4 × (0.95-0.34)/0.34 × 100   (α improvement: +179%)
                   + 0.4 × (0.96-0.30)/0.96 × 100    (ASR reduction: +68.8%)
                   + 0.2 × (0.35-0.03) × 100          (π improvement: +6.4%)
                   = 71.8% + 27.5% + 6.4%
                   = 105.7% weighted improvement

The closed-loop promotion mechanism provides an additional 18% improvement beyond the base 3-Tier defense (C3 → C4), primarily through increasing π from 0.15 to 0.35.

8.5 Deterministic Promotion Impact

Table VII: Simulated Auto-Promotion Results

Severity PresetPromotion ThresholdQueries Promotedπ AfterASR@1000 AfterIncremental Improvement
Default (1.0)0.8045%0.300.35+8% defense rate
High (1.5)0.7062%0.400.25+12% defense rate
Critical (2.0)0.6078%0.500.18+17% defense rate

The Critical preset — appropriate for defense/military domains — promotes 78% of attack patterns to deterministic blocks, reducing ASR@1000 to 0.18 and achieving a 17% incremental defense improvement. However, aggressive promotion carries a false positive risk: some promoted patterns may match legitimate content, requiring ongoing precision monitoring.

8.6 Performance

MetricValue
Deterministic Check< 0.1ms
PALADIN Pipeline P50~2ms
PALADIN Pipeline P99~20ms
Throughput50,000+ RPS
SABER Full Analysis< 500ms
SABER Quick Assessment< 10ms

9. Discussion

9.1 The Power-Law Nature of LLM Vulnerability

Our most significant theoretical finding is that LLM vulnerability follows a power-law scaling pattern. The ASR@N ∝ 1 − C·N^(−α) relationship means that:

  • For small α (< 0.3): Risk grows rapidly. GPT-5 (α=0.12) reaches 94% ASR by just N=10 attempts. Models in this regime are essentially indefensible without external guardrails.
  • For moderate α (0.3–0.6): Risk growth is manageable. The defense system has a meaningful "window" in which to intervene between initial probes and sustained attacks.
  • For large α (> 0.6): Risk plateaus quickly. The model's own safety mechanisms provide a substantial baseline defense.

This power-law structure has a practical implication: the most impactful defense interventions are those that increase α, as even small α improvements dramatically reduce the rate at which risk amplifies. The AEGIS defense pipeline increases mean α from 0.34 to 0.95, transforming the risk trajectory from rapid amplification to effective saturation.

9.2 The Impenetrability Fraction as the Ultimate Defense

The SABER framework reveals that the impenetrability fraction π is the only parameter that provides an absolute guarantee against attack success. ASR@N → (1−π) as N → ∞, meaning:

  • π = 0: The model will eventually be compromised with probability 1 given sufficient attacker budget.
  • π = 0.5: The maximum achievable ASR is 50%, regardless of attacker resources.
  • π = 1: The model is provably safe against all attacks in the evaluated distribution.

The closed-loop promotion mechanism is designed specifically to increase π by converting stochastically vulnerable patterns into deterministic blocks. This is the most direct path to long-term safety improvement.

9.3 PAIR and Crescendo: The α → 0 Limit

PAIR and Crescendo's universal 100% ASR against all models represents the α → 0 extreme. These algorithms are so effective that the Beta distribution degenerates — essentially all queries are vulnerable with high probability. This finding has profound implications:

  • PAIR demonstrates that iterative refinement with an attacker LLM is a universal capability that no current safety alignment technique can prevent.
  • Crescendo demonstrates that multi-turn conversational manipulation fundamentally bypasses single-turn safety mechanisms.

Defending against these algorithms requires fundamentally different approaches: response-level analysis (to detect information leakage in refusals) for PAIR, and trajectory-level analysis (PALADIN L5) for Crescendo.

9.4 Reasoning Mode and α Enhancement

The observation that reasoning mode increases α by 79% (0.28 → 0.50) suggests a mechanism by which chain-of-thought processing enhances safety:

  1. Deliberation time: The model has more compute to evaluate request intent before generating a response.
  2. Explicit risk assessment: The reasoning chain can include explicit safety considerations that the model can then act on.
  3. Pattern recognition: Reasoning mode appears particularly effective at recognizing structurally anomalous prompts (AutoDAN) and encoded content (ArtPrompt).

However, reasoning mode does not increase α against conversational attacks (Crescendo, HPM), suggesting that deliberation helps with structural analysis but not with social manipulation resistance.

9.5 Limitations

  1. SABER parameter estimation from limited data. Two sessions of 7 algorithms each provides 14 trials per model — sufficient for Method of Moments estimation but below ideal for precise confidence intervals.

  2. Independence assumption. SABER assumes that attack attempts are independent conditional on θ. In practice, adaptive attackers who learn from failed attempts may exhibit correlated success probabilities.

  3. Static vulnerability model. The Beta distribution parameters are estimated at a point in time. Model updates, fine-tuning, and prompt changes may shift (α, β, π) in unpredictable ways.

  4. Auto-promotion false positives. Aggressive deterministic promotion (Critical preset: 78% promotion rate) may block legitimate content. Production deployments require ongoing precision monitoring and human review of promoted patterns.

10. Conclusion

We presented AEGIS, an integrated framework that formally connects offensive red-teaming, defensive guardrails, and statistical risk prediction through the SABER framework. Our key contributions and findings are:

SABER provides the first statistical framework for LLM vulnerability prediction. By modeling per-query vulnerability as θ ~ Beta(α, β), SABER derives the ASR@N scaling law that predicts attack success under Best-of-N scenarios, the Budget@τ metric that quantifies defender resilience in terms of attacker cost, and risk/defense grading systems that enable operational decision-making with 95% confidence intervals.

The closed-loop architecture transforms safety from static to adaptive. SABER's auto-promotion mechanism converts high-risk stochastic vulnerabilities into deterministic defenses, increasing the impenetrability fraction π and establishing a continuously hardening defense posture. This mechanism provides an additional 8–17% defense improvement beyond the base guardrail architecture.

Empirical results are alarming. All 8 tested models (including GPT-5, Claude Opus 4.6, Gemini 3.1 Pro) achieve Defense Grade = Weak without external guardrails, with a mean Budget@0.5 of only 3 attempts. PAIR and Crescendo achieve 100% ASR universally. These findings demonstrate that LLM-native safety mechanisms are fundamentally insufficient and that external guardrail systems with statistical risk monitoring are not optional but essential for any production deployment.

The α parameter is the most actionable metric for defense improvement. Increasing α — through layered defense, reasoning mode activation, or architectural changes — is the most effective way to slow the rate of risk amplification. The AEGIS defense pipeline increases mean α from 0.34 to 0.95, transforming vulnerability trajectories from rapid amplification to effective saturation.

As LLMs are deployed in increasingly high-stakes applications, the need for rigorous, statistical safety assurance will only intensify. AEGIS and SABER provide the mathematical and engineering foundations for this assurance, enabling security teams to quantify risk, predict exposure, and systematically harden defenses through continuous closed-loop improvement.

References

[1] P. Chao, A. Robey, E. Dobriban, et al., "Jailbreaking Black Box Large Language Models in Twenty Queries," arXiv:2310.08419, 2023.

[2] A. Mehrotra, M. Zampetakis, P. Kassianik, et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subversions," arXiv:2312.02119, 2023.

[3] M. Russinovich, A. Salem, R. Elber, "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack," Microsoft Research, 2024.

[4] X. Liu, N. Xu, M. Chen, C. Xiao, "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models," arXiv:2310.04451, 2023.

[5] S. Sadasivan, S. Saha, G. Sriramanan, et al., "Fast Adversarial Attacks on Language Models In One GPU Minute," arXiv:2402.15570, 2024.

[6] F. Jiang, Z. Xu, L. Niu, et al., "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs," arXiv:2402.11753, 2024.

[7] Anonymous, "Human-like Psychological Manipulation of LLMs," arXiv:2512.18244, 2025.

[8] A. Zou, Z. Wang, N. Carlini, et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models," arXiv:2307.15043, 2023.

[9] Anonymous, "GuardNet: Hierarchical Graph-Based Detection for LLM Safety," arXiv:2509.23037, 2025.

[10] Anonymous, "JBShield: Jailbreak Detection via Linear Representation Hypothesis," arXiv:2502.07557, 2025.

[11] M. Mazeika, L. Phan, X. Yin, et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal," arXiv:2402.04249, 2024.

[12] P. Chao, E. Dobriban, et al., "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models," arXiv:2404.01318, 2024.

[13] A. Zou, Z. Wang, et al., "AdvBench: A Benchmark for Evaluating Adversarial Robustness of Large Language Models," 2023.

[14] J. Kaplan, S. McCandlish, T. Henighan, et al., "Scaling Laws for Neural Language Models," arXiv:2001.08361, 2020.

[15] J. Hoffmann, S. Borgeaud, A. Mensch, et al., "Training Compute-Optimal Large Language Models," arXiv:2203.15556, 2022.

[16] Y. Bai, S. Kadavath, et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv:2212.08073, 2022.

[17] OWASP, "OWASP Top 10 for LLM Applications," 2025.

[18] National Institute of Standards and Technology, "AI Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, 2023.

[19] AEGIS Research Team, "SABER: Statistical Adversarial Risk with Beta Extrapolation and Regression — Technical Report," Internal Report, 2026.

Appendix A: SABER API Reference

A.1 Endpoints

MethodPathFunction
POST/v3/saber/estimateFull SABER analysis from query trial data
POST/v3/saber/evaluateSingle-content Best-of-N evaluation
GET/v3/saber/budgetBudget@τ computation for given α, β
POST/v3/saber/compareBefore/after defense comparison
GET/v3/saber/report/{id}Retrieve stored SABER report
POST/v3/saber/deterministic/updatePattern management + auto-promotion

A.2 Quick Assessment

For real-time monitoring, SABER provides a lightweight quick assessment mode:

Input:  observed_asr: f64, n_trials: usize
Output: QuickAssessment {
          risk_grade: RiskGrade,
          estimated_asr_1000: f64,
          estimated_budget_50: usize,
          needs_full_analysis: bool
        }
Latency: < 10ms

This enables integration into real-time traffic monitoring pipelines without the computational cost of full Beta distribution fitting.

Appendix B: Database Schema

-- SABER Reports
CREATE TABLE saber_reports (
  id UUID PRIMARY KEY,
  tenant_id UUID,
  alpha FLOAT8 NOT NULL,
  beta_param FLOAT8 NOT NULL,
  unbreakable_fraction FLOAT8,
  goodness_of_fit FLOAT8,
  asr_predictions JSONB NOT NULL,
  confidence_intervals JSONB,
  budget_estimates JSONB,
  risk_grade VARCHAR(20) NOT NULL,
  defense_grade VARCHAR(20) NOT NULL,
  recommendations JSONB,
  analysis_time_ms FLOAT8,
  created_at TIMESTAMP DEFAULT NOW()
);

-- Per-Query Risk Profiles
CREATE TABLE saber_query_profiles (
  id UUID PRIMARY KEY,
  report_id UUID REFERENCES saber_reports(id),
  query_id VARCHAR NOT NULL,
  category VARCHAR,
  total_trials INTEGER NOT NULL,
  success_count INTEGER NOT NULL,
  observed_asr FLOAT8 NOT NULL,
  estimated_theta FLOAT8,
  is_deterministic_block BOOLEAN,
  predicted_asr_1000 FLOAT8,
  budget_at_50 INTEGER,
  created_at TIMESTAMP DEFAULT NOW()
);

-- Deterministic Defense Patterns
CREATE TABLE deterministic_defense_patterns (
  id UUID PRIMARY KEY,
  pattern VARCHAR NOT NULL,
  pattern_type VARCHAR NOT NULL,   -- regex | exact | contains
  category VARCHAR,
  source VARCHAR NOT NULL,          -- manual | saber_auto | evolution
  source_report_id UUID REFERENCES saber_reports(id),
  is_active BOOLEAN DEFAULT TRUE,
  min_safety_level VARCHAR DEFAULT 'standard',
  created_at TIMESTAMP DEFAULT NOW()
);

-- Best-of-N Evaluation Log
CREATE TABLE bon_evaluations (
  id UUID PRIMARY KEY,
  content_hash VARCHAR NOT NULL,
  n_trials INTEGER NOT NULL,
  blocked_count INTEGER NOT NULL,
  passed_count INTEGER NOT NULL,
  observed_asr FLOAT8 NOT NULL,
  estimated_theta FLOAT8,
  is_deterministic_block BOOLEAN,
  created_at TIMESTAMP DEFAULT NOW()
);

Assets & Downloads

Executive SummaryComing Soon
Slide DeckComing Soon
Demo VideoComing Soon

Interested in applying this research?

Contact the AEGIS Research team to learn how this work can support your AI deployment needs.