Problem
As LLM-based services like ChatGPT, Claude, and Gemini rapidly proliferate, users increasingly submit dangerous prompts — intentionally or unintentionally — that can bypass AI safety mechanisms or leak sensitive information. Existing server-side filtering approaches suffer from three critical limitations:
- Network latency — delayed threat detection
- Privacy violations — user prompts transmitted to external servers
- Service dependency — tied to specific AI service providers
AEGINEL Guard addresses these limitations by adopting a client-side, on-device inference pipeline that runs entirely within the user's browser, ensuring real-time protection without compromising privacy.
Our Approach
We defined the task as multi-label text classification across 7 categories (including "safe"), targeting 6 threat types:
- Jailbreak — attempts to bypass AI safety restrictions
- Prompt Injection — instructions to override system prompts
- Harmful Content — requests for dangerous or illegal information
- Script Evasion — code injection and XSS-style attacks
- Social Engineering — manipulation through social context
- Encoding Bypass — obfuscation techniques to evade filters
We built a custom dataset of 188,109 samples across 8 languages (Korean, English, Arabic, Spanish, Russian, Malay, Chinese, Japanese) and compared three Transformer models under browser extension deployment constraints (< 150 MB).
System Architecture
User Input (AI Service Textbox)
│
▼
Content Script → Service Worker → Offscreen Document
│ │
│ Transformers.js + ONNX Runtime
│ │
▼ ▼
Warning Banner ←── Classification Result
Key Contributions
- Built a 188,109-sample, 8-language dataset for multi-label prompt security classification
- Conducted comparative experiments across three multilingual Transformer models (XLM-RoBERTa, DistilBERT, mDeBERTa) with quantitative performance-size-speed tradeoff analysis
- Achieved 100% binary detection accuracy (safe vs. harmful) with the final model
- Delivered a 129.5 MB INT8 ONNX model with 7.6 ms/sample CPU inference — fully deployable in Chrome extensions via Transformers.js
- Identified and resolved critical bugs in multi-label sigmoid inference and model path configuration
Key Findings
Model Comparison Results
| Model | Parameters | F1-micro | F1-macro | INT8 ONNX Size | Training Time |
|---|---|---|---|---|---|
| XLM-RoBERTa-base | 278M | 0.99969 | 0.99974 | 265.9 MB | 43 min |
| DistilBERT-multilingual | 135M | 0.99780 | 0.99813 | 129.5 MB | 16 min |
| mDeBERTa-v3-base | 279M | 0.99969 | 0.99974 | 322.8 MB | 52 min |
Performance-Size Tradeoff
The F1-micro difference between the top-performing models (XLM-RoBERTa, mDeBERTa) and DistilBERT is only 0.189 percentage points, while DistilBERT is 2.05x smaller than XLM-RoBERTa and 2.49x smaller than mDeBERTa.
Inference Test Results
| Metric | Value |
|---|---|
| Label-specific accuracy | 66.7% (6/9) |
| Binary detection accuracy (safe vs. harmful) | 100% (9/9) |
| Multilingual prompt injection detection | 100% (3/3) |
| Average inference latency (CPU, INT8) | 7.6 ms/sample |
All label confusion cases occurred within the same threat cluster — the binary "is this dangerous?" decision achieved perfect accuracy across all test prompts, including Korean, English, Japanese, and French inputs.
Evaluation Design
- Dataset: 188,109 samples, self-constructed multilingual prompt security dataset
- Languages: 8 (ko 24.9%, en 21.6%, ar 9.5%, es 9.1%, ru 9.1%, ms 8.8%, zh 8.5%, ja 8.5%)
- Train/Val Split: 90:10 (169,298 / 18,811)
- Multi-label samples: 21,544 (11.5%)
- Loss Function: Binary Cross-Entropy with Logits
- Inference Threshold: sigmoid >= 0.5
- Optimization Metric: F1-micro
- Quantization: INT8 Dynamic Quantization (AVX-512 VNNI target)
- Hardware: NVIDIA RTX A5000 x2, CUDA 12.8
Final Model Selection Criteria
| Priority | Criterion | Requirement | DistilBERT Result |
|---|---|---|---|
| 1 | Deployment size | < 150 MB | 129.5 MB |
| 2 | Binary detection accuracy | 100% | 100% |
| 3 | Inference speed | Real-time | 7.6 ms/sample |
| 4 | Label-specific accuracy | Best effort | 99.78% F1-micro |
Business Relevance
This research demonstrates that enterprise-grade prompt security can be delivered entirely on-device without server-side dependencies, privacy trade-offs, or network latency. Key implications for organizations:
- Privacy-first AI safety: No user prompts leave the local device
- Universal protection: Works across any LLM service (ChatGPT, Claude, Gemini, etc.)
- Deployment-ready: Complete Chrome extension pipeline validated with production build
- Multilingual coverage: 8 languages supported out of the box, covering major global markets
- Cost-effective: No API costs or server infrastructure required for prompt filtering
The AEGINEL Guard technology directly extends the AEGIS Guardrail product family, providing a complementary client-side protection layer alongside server-side safety systems.
Limitations
- Label-specific accuracy: 66.7% — confusion between
encoding_bypass,social_engineering, andscript_evasioncategories due to overlapping threat characteristics - Browser inference benchmarks: Actual Transformers.js WASM inference speed in Chrome not yet measured (server CPU baseline of 7.6 ms may differ)
- Synthetic data bias: Training dataset is primarily synthetic; real-world prompt distribution may differ
- Language coverage: 8 languages covered; additional languages require dataset expansion
- Adversarial robustness: Not yet tested against advanced adversarial prompt techniques specifically designed to evade the classifier