Problem
Large language models often generate responses that sound plausible but are not supported by evidence. In enterprise environments — finance, healthcare, legal services — these hallucinations can trigger regulatory liability, reputational damage, and financial loss. Existing mitigation methods are often fragmented, difficult to operationalize, or insufficiently measurable in production settings.
The core challenge is not merely detecting hallucinations after they occur, but building systems that architecturally prevent ungrounded content from reaching end users in the first place.
Architecture Overview
TruthAnchor implements a four-layer defense pipeline, each layer operating as an independent verification checkpoint:
- Layer 1: Input Governance — Intent classification, prompt injection defense, PII protection
- Layer 2: Evidence-Grounded Generation — Hybrid RAG retrieval, compliance guardrails, constrained generation
- Layer 3: Output Verification — Uncertainty quantification, citation linking, multi-LLM consensus
- Layer 4: Escalation & Human-in-the-Loop — Risk-based routing, human review for high-uncertainty outputs
Layer 1: Input Governance
Intent Classification
Every incoming query is classified into one of seven intent categories with associated risk levels:
| Intent Code | Category | Risk Level |
|---|---|---|
| loan_inquiry | Loan Products & Rates | Medium |
| deposit_inquiry | Deposit & Savings | Low |
| investment_advisory | Investment Advice | High |
| credit_assessment | Credit Evaluation | High |
| insurance_inquiry | Insurance Products | Medium |
| general_inquiry | General Questions | Low |
| out_of_scope | Non-Financial | Reject |
Triple-Layered Prompt Injection Defense
The system deploys three independent detection layers:
- Static Pattern Matching — Aho-Corasick algorithm with NFKC Unicode normalization and Base64 bypass detection
- ML Classification — Logistic regression classifier trained on adversarial examples
- Structural Analysis — Detects role manipulation, instruction override, and format injection attempts
Layer 2: Evidence-Grounded Generation
Three-Tier Data Architecture
| Tier | Content | Authority | Update Frequency |
|---|---|---|---|
| A: Regulatory | Laws, regulations, disclosure requirements | Highest | Quarterly |
| B: Sector | Banking/insurance/securities-specific guidelines | Medium | Monthly |
| C: Institutional | Products, internal policies, FAQs | Operational | Weekly |
Compliance Guardrail Rules
Seven core compliance rules enforce regulatory boundaries at generation time:
| Rule ID | Name | Trigger | Action |
|---|---|---|---|
| CG-001 | Investment Solicitation | 25 solicitation patterns | Replace with disclaimer |
| CG-002 | Return Guarantee | 9 guarantee patterns | Remove + warning |
| CG-003 | Numerical Source Required | Numbers without citation | Flag for review |
| CG-004 | Interest Rate Accuracy | Rate deviation ±0.01%p | Correct or escalate |
| CG-005 | Mandatory Disclaimer | Investment/credit intents | Auto-append disclaimer |
| CG-006 | PII Exposure | Residual PII in output | Mask + alert |
| CG-007 | Ungrounded Generalization | Unsourced broad claims | Flag for review |
Layer 3: Output Verification
Four-Signal Uncertainty Quantification
The uncertainty scorer combines four independent signals to produce a composite confidence score:
- Token-Level Entropy (weight: 0.30) — Measures generation confidence at the token level
- Self-Consistency (weight: 0.25) — Samples at temperatures and measures agreement
- Claim-Level Verification (weight: 0.30) — Verifies each factual claim against retrieved evidence
- Expected Calibration Error (weight: 0.15) — Rolling calibration against ground truth
Decision thresholds:
- U < 0.30 → HIGH confidence — Serve response normally
- 0.30 ≤ U < 0.60 → MEDIUM confidence — Add advisory disclaimer
- U ≥ 0.60 → LOW confidence — Escalate for human review
Multi-LLM Consensus Validation
For high-risk queries, responses are validated across multiple LLMs using agglomerative clustering:
| Consensus Score | Classification | Action |
|---|---|---|
| ≥ 0.95 | Full consensus | Approve response |
| ≥ 0.60 | Majority consensus | Log minority dissent, return majority |
| ≥ 0.40 | Partial consensus | Log all responses, return with caveat |
| < 0.40 | No consensus | Escalate to human review |
Layer 4: Escalation & Human-in-the-Loop
| Priority | SLA | Trigger Condition |
|---|---|---|
| CRITICAL | 5 min | HIGH/CRITICAL guardrail violations |
| CRITICAL | 15 min | Investment intent + amount ≥ 100M KRW |
| HIGH | 30 min | Uncertainty U ≥ 0.7 or fraud detection |
| HIGH | 30 min | AML suspicious activity patterns |
| MEDIUM | 60 min | Low retrieval relevance + low intent confidence |
| LOW | Auto | Citation coverage 0.5–0.8 (auto-disclaimer) |
System Resilience
Three-Tier Caching Architecture
| Cache Level | Latency | Capacity | Content | Hit Rate |
|---|---|---|---|---|
| L1: In-Memory LRU | < 1ms | 1,000 entries | Interest rates, exchange rates, product info | ~35% |
| L2: Redis Cache | < 5ms | 500 entries | RAG search results, prompt templates | ~25% |
| L3: Semantic Cache | < 10ms | 5,000 entries | Query embedding similarity (≥ 0.95) | ~15% |
Combined effective cache hit rate: ~60%
Evaluation Results
Benchmark Dataset
The system was evaluated on 104 test cases across 10 categories in a Korean financial services context.
| Category | Test Cases | Description |
|---|---|---|
| Factual accuracy | 3 | Correct financial fact generation |
| Numerical accuracy | 2 | Interest rate, fee precision |
| Compliance | 3 | Guardrail rule enforcement |
| Citation verification | 2 | Source attribution accuracy |
| Fabrication detection | 2 | Detecting invented facts |
| PII protection | 2 | PII masking effectiveness |
| Injection defense | 15 | Adversarial prompt attacks |
| Integration (E2E) | 12 | Full pipeline end-to-end |
| Unit tests | 50 | Component-level verification |
Hallucination Detection Results
| Hallucination Type | Cases | Detected | Rate |
|---|---|---|---|
| Numerical fabrication | 3 | 3 | 100% |
| Return guarantees | 3 | 3 | 100% |
| Investment solicitation | 2 | 2 | 100% |
| PII leakage | 2 | 2 | 100% |
| Insider information | 2 | 2 | 100% |
| Total | 12 | 12 | 100% |
Prompt Injection Defense Results
| Attack Type | Cases | Blocked | Rate |
|---|---|---|---|
| Direct override | 3 | 3 | 100% |
| Role manipulation | 2 | 2 | 100% |
| Jailbreak | 2 | 2 | 100% |
| Korean injection | 3 | 3 | 100% |
| Unicode obfuscation | 2 | 2 | 100% |
| Format injection | 3 | 3 | 100% |
| Total | 15 | 15 | 100% |
Latency Performance
| Percentile | Latency | Target |
|---|---|---|
| p50 | 45ms | — |
| p95 | 142ms | ≤ 200ms ✓ |
| p99 | 187ms | — |
Rust vs. Python Native Engine
| Metric | Python | Rust | Improvement |
|---|---|---|---|
| Throughput | 10K patterns/sec | 85K patterns/sec | 8.5× |
| p95 Latency | 18ms | 2ms | 9.0× |
| Memory | 45MB | 12MB | 3.75× |
Ablation Study
| Configuration | Detection Rate | Notes |
|---|---|---|
| Full system (4 layers) | 100% | Baseline |
| Without Layer 1 (input governance) | 100% | Injection attacks succeed |
| Without Layer 3 (output verification) | ~82% | Numerical errors and ungrounded claims pass |
| Without uncertainty scorer | ~88% | No confidence-based escalation |
| Without consensus validation | ~94% | High-risk queries may pass unchecked |
| Without citation linker | ~91% | Ungrounded claims not flagged |
Key Contributions
- Four-layer defense pipeline — Input governance, evidence-grounded generation, output verification, and HITL escalation
- Triple-layered prompt injection defense — 100% detection across 6 attack categories
- Four-signal uncertainty quantification — Token entropy, self-consistency, claim verification, ECE
- Multi-LLM consensus validation — Agglomerative clustering for high-risk query verification
- Rust-accelerated guardrails — 8.5× throughput improvement via PyO3 bindings
- Three-tier caching — 60% combined hit rate with sub-10ms latency
- Production-grade performance — 100% hallucination detection at p95 142ms latency
Limitations
This study was evaluated in a Korean financial services context with 104 test cases. Results should not be interpreted as universal guarantees. Additional validation is required for broader domains, multilingual deployments, multimodal inputs, and agentic AI scenarios. The multi-LLM consensus mechanism adds 3-5 seconds of latency when triggered, making it suitable only for high-risk queries.