AEGIS-TR-2026-001Technical Reportv2.0

Reducing Hallucinations in Enterprise AI Systems

A Multi-Layered Defense Architecture for Mission-Critical Domains

Authors: TruthAnchor Research Group, AEGIS Research Team
Published: March 2026
Affiliation: AEGIS Research
hallucinationTruthAnchorRAGuncertainty quantificationcompliance guardrailshuman-in-the-loopenterprise AIprompt injection defense

Summary

This paper presents TruthAnchor, a comprehensive multi-layered defense architecture designed to reduce hallucination rates in enterprise AI deployments to below 3%. The system implements a four-layer pipeline comprising input governance, evidence-grounded generation, output verification, and human-in-the-loop escalation. Key innovations include a triple-layered prompt injection defense achieving 100% detection, a four-signal composite uncertainty scorer, and a multi-LLM consensus validation mechanism. Evaluated in a Korean financial services context with 104 test cases, the system achieves 100% hallucination detection rate, 100% RAG accuracy, and p95 latency of 142ms.

Problem

Large language models often generate responses that sound plausible but are not supported by evidence. In enterprise environments — finance, healthcare, legal services — these hallucinations can trigger regulatory liability, reputational damage, and financial loss. Existing mitigation methods are often fragmented, difficult to operationalize, or insufficiently measurable in production settings.

The core challenge is not merely detecting hallucinations after they occur, but building systems that architecturally prevent ungrounded content from reaching end users in the first place.

Architecture Overview

TruthAnchor implements a four-layer defense pipeline, each layer operating as an independent verification checkpoint:

  1. Layer 1: Input Governance — Intent classification, prompt injection defense, PII protection
  2. Layer 2: Evidence-Grounded Generation — Hybrid RAG retrieval, compliance guardrails, constrained generation
  3. Layer 3: Output Verification — Uncertainty quantification, citation linking, multi-LLM consensus
  4. Layer 4: Escalation & Human-in-the-Loop — Risk-based routing, human review for high-uncertainty outputs

Layer 1: Input Governance

Intent Classification

Every incoming query is classified into one of seven intent categories with associated risk levels:

Intent CodeCategoryRisk Level
loan_inquiryLoan Products & RatesMedium
deposit_inquiryDeposit & SavingsLow
investment_advisoryInvestment AdviceHigh
credit_assessmentCredit EvaluationHigh
insurance_inquiryInsurance ProductsMedium
general_inquiryGeneral QuestionsLow
out_of_scopeNon-FinancialReject

Triple-Layered Prompt Injection Defense

The system deploys three independent detection layers:

  • Static Pattern Matching — Aho-Corasick algorithm with NFKC Unicode normalization and Base64 bypass detection
  • ML Classification — Logistic regression classifier trained on adversarial examples
  • Structural Analysis — Detects role manipulation, instruction override, and format injection attempts

Layer 2: Evidence-Grounded Generation

Three-Tier Data Architecture

TierContentAuthorityUpdate Frequency
A: RegulatoryLaws, regulations, disclosure requirementsHighestQuarterly
B: SectorBanking/insurance/securities-specific guidelinesMediumMonthly
C: InstitutionalProducts, internal policies, FAQsOperationalWeekly

Compliance Guardrail Rules

Seven core compliance rules enforce regulatory boundaries at generation time:

Rule IDNameTriggerAction
CG-001Investment Solicitation25 solicitation patternsReplace with disclaimer
CG-002Return Guarantee9 guarantee patternsRemove + warning
CG-003Numerical Source RequiredNumbers without citationFlag for review
CG-004Interest Rate AccuracyRate deviation ±0.01%pCorrect or escalate
CG-005Mandatory DisclaimerInvestment/credit intentsAuto-append disclaimer
CG-006PII ExposureResidual PII in outputMask + alert
CG-007Ungrounded GeneralizationUnsourced broad claimsFlag for review

Layer 3: Output Verification

Four-Signal Uncertainty Quantification

The uncertainty scorer combines four independent signals to produce a composite confidence score:

  • Token-Level Entropy (weight: 0.30) — Measures generation confidence at the token level
  • Self-Consistency (weight: 0.25) — Samples at temperatures and measures agreement
  • Claim-Level Verification (weight: 0.30) — Verifies each factual claim against retrieved evidence
  • Expected Calibration Error (weight: 0.15) — Rolling calibration against ground truth

Decision thresholds:

  • U < 0.30 → HIGH confidence — Serve response normally
  • 0.30 ≤ U < 0.60 → MEDIUM confidence — Add advisory disclaimer
  • U ≥ 0.60 → LOW confidence — Escalate for human review

Multi-LLM Consensus Validation

For high-risk queries, responses are validated across multiple LLMs using agglomerative clustering:

Consensus ScoreClassificationAction
≥ 0.95Full consensusApprove response
≥ 0.60Majority consensusLog minority dissent, return majority
≥ 0.40Partial consensusLog all responses, return with caveat
< 0.40No consensusEscalate to human review

Layer 4: Escalation & Human-in-the-Loop

PrioritySLATrigger Condition
CRITICAL5 minHIGH/CRITICAL guardrail violations
CRITICAL15 minInvestment intent + amount ≥ 100M KRW
HIGH30 minUncertainty U ≥ 0.7 or fraud detection
HIGH30 minAML suspicious activity patterns
MEDIUM60 minLow retrieval relevance + low intent confidence
LOWAutoCitation coverage 0.5–0.8 (auto-disclaimer)

System Resilience

Three-Tier Caching Architecture

Cache LevelLatencyCapacityContentHit Rate
L1: In-Memory LRU< 1ms1,000 entriesInterest rates, exchange rates, product info~35%
L2: Redis Cache< 5ms500 entriesRAG search results, prompt templates~25%
L3: Semantic Cache< 10ms5,000 entriesQuery embedding similarity (≥ 0.95)~15%

Combined effective cache hit rate: ~60%

Evaluation Results

Benchmark Dataset

The system was evaluated on 104 test cases across 10 categories in a Korean financial services context.

CategoryTest CasesDescription
Factual accuracy3Correct financial fact generation
Numerical accuracy2Interest rate, fee precision
Compliance3Guardrail rule enforcement
Citation verification2Source attribution accuracy
Fabrication detection2Detecting invented facts
PII protection2PII masking effectiveness
Injection defense15Adversarial prompt attacks
Integration (E2E)12Full pipeline end-to-end
Unit tests50Component-level verification

Hallucination Detection Results

Hallucination TypeCasesDetectedRate
Numerical fabrication33100%
Return guarantees33100%
Investment solicitation22100%
PII leakage22100%
Insider information22100%
Total1212100%

Prompt Injection Defense Results

Attack TypeCasesBlockedRate
Direct override33100%
Role manipulation22100%
Jailbreak22100%
Korean injection33100%
Unicode obfuscation22100%
Format injection33100%
Total1515100%

Latency Performance

PercentileLatencyTarget
p5045ms
p95142ms≤ 200ms ✓
p99187ms

Rust vs. Python Native Engine

MetricPythonRustImprovement
Throughput10K patterns/sec85K patterns/sec8.5×
p95 Latency18ms2ms9.0×
Memory45MB12MB3.75×

Ablation Study

ConfigurationDetection RateNotes
Full system (4 layers)100%Baseline
Without Layer 1 (input governance)100%Injection attacks succeed
Without Layer 3 (output verification)~82%Numerical errors and ungrounded claims pass
Without uncertainty scorer~88%No confidence-based escalation
Without consensus validation~94%High-risk queries may pass unchecked
Without citation linker~91%Ungrounded claims not flagged

Key Contributions

  • Four-layer defense pipeline — Input governance, evidence-grounded generation, output verification, and HITL escalation
  • Triple-layered prompt injection defense — 100% detection across 6 attack categories
  • Four-signal uncertainty quantification — Token entropy, self-consistency, claim verification, ECE
  • Multi-LLM consensus validation — Agglomerative clustering for high-risk query verification
  • Rust-accelerated guardrails — 8.5× throughput improvement via PyO3 bindings
  • Three-tier caching — 60% combined hit rate with sub-10ms latency
  • Production-grade performance — 100% hallucination detection at p95 142ms latency

Limitations

This study was evaluated in a Korean financial services context with 104 test cases. Results should not be interpreted as universal guarantees. Additional validation is required for broader domains, multilingual deployments, multimodal inputs, and agentic AI scenarios. The multi-LLM consensus mechanism adds 3-5 seconds of latency when triggered, making it suitable only for high-risk queries.

Assets & Downloads

Executive SummaryComing Soon
Slide DeckComing Soon
Demo VideoComing Soon

Interested in applying this research?

Contact the AEGIS Research team to learn how this work can support your AI deployment needs.