AEGINEL Guard: Multilingual AI Prompt Security Classifier for Browser Extensions

Problem

As LLM-based services like ChatGPT, Claude, and Gemini rapidly proliferate, users increasingly submit dangerous prompts — intentionally or unintentionally — that can bypass AI safety mechanisms or leak sensitive information. Existing server-side filtering approaches suffer from three critical limitations:

Network latency — delayed threat detection
Privacy violations — user prompts transmitted to external servers
Service dependency — tied to specific AI service providers

AEGINEL Guard addresses these limitations by adopting a client-side, on-device inference pipeline that runs entirely within the user's browser, ensuring real-time protection without compromising privacy.

Our Approach

We defined the task as multi-label text classification across 7 categories (including "safe"), targeting 6 threat types:

Jailbreak — attempts to bypass AI safety restrictions
Prompt Injection — instructions to override system prompts
Harmful Content — requests for dangerous or illegal information
Script Evasion — code injection and XSS-style attacks
Social Engineering — manipulation through social context
Encoding Bypass — obfuscation techniques to evade filters

We built a custom dataset of 188,109 samples across 8 languages (Korean, English, Arabic, Spanish, Russian, Malay, Chinese, Japanese) and compared three Transformer models under browser extension deployment constraints (< 150 MB).

System Architecture

User Input (AI Service Textbox)
        │
        ▼
Content Script → Service Worker → Offscreen Document
        │                              │
        │                    Transformers.js + ONNX Runtime
        │                              │
        ▼                              ▼
Warning Banner ←── Classification Result

Key Contributions

Built a 188,109-sample, 8-language dataset for multi-label prompt security classification
Conducted comparative experiments across three multilingual Transformer models (XLM-RoBERTa, DistilBERT, mDeBERTa) with quantitative performance-size-speed tradeoff analysis
Achieved 100% binary detection accuracy (safe vs. harmful) with the final model
Delivered a 129.5 MB INT8 ONNX model with 7.6 ms/sample CPU inference — fully deployable in Chrome extensions via Transformers.js
Identified and resolved critical bugs in multi-label sigmoid inference and model path configuration

Key Findings

Model Comparison Results

Model	Parameters	F1-micro	F1-macro	INT8 ONNX Size	Training Time
XLM-RoBERTa-base	278M	0.99969	0.99974	265.9 MB	43 min
DistilBERT-multilingual	135M	0.99780	0.99813	129.5 MB	16 min
mDeBERTa-v3-base	279M	0.99969	0.99974	322.8 MB	52 min

Performance-Size Tradeoff

The F1-micro difference between the top-performing models (XLM-RoBERTa, mDeBERTa) and DistilBERT is only 0.189 percentage points, while DistilBERT is 2.05x smaller than XLM-RoBERTa and 2.49x smaller than mDeBERTa.

Inference Test Results

Metric	Value
Label-specific accuracy	66.7% (6/9)
Binary detection accuracy (safe vs. harmful)	100% (9/9)
Multilingual prompt injection detection	100% (3/3)
Average inference latency (CPU, INT8)	7.6 ms/sample

All label confusion cases occurred within the same threat cluster — the binary "is this dangerous?" decision achieved perfect accuracy across all test prompts, including Korean, English, Japanese, and French inputs.

Evaluation Design

Dataset: 188,109 samples, self-constructed multilingual prompt security dataset
Languages: 8 (ko 24.9%, en 21.6%, ar 9.5%, es 9.1%, ru 9.1%, ms 8.8%, zh 8.5%, ja 8.5%)
Train/Val Split: 90:10 (169,298 / 18,811)
Multi-label samples: 21,544 (11.5%)
Loss Function: Binary Cross-Entropy with Logits
Inference Threshold: sigmoid >= 0.5
Optimization Metric: F1-micro
Quantization: INT8 Dynamic Quantization (AVX-512 VNNI target)
Hardware: NVIDIA RTX A5000 x2, CUDA 12.8

Final Model Selection Criteria

Priority	Criterion	Requirement	DistilBERT Result
1	Deployment size	< 150 MB	129.5 MB
2	Binary detection accuracy	100%	100%
3	Inference speed	Real-time	7.6 ms/sample
4	Label-specific accuracy	Best effort	99.78% F1-micro

Business Relevance

This research demonstrates that enterprise-grade prompt security can be delivered entirely on-device without server-side dependencies, privacy trade-offs, or network latency. Key implications for organizations:

Privacy-first AI safety: No user prompts leave the local device
Universal protection: Works across any LLM service (ChatGPT, Claude, Gemini, etc.)
Deployment-ready: Complete Chrome extension pipeline validated with production build
Multilingual coverage: 8 languages supported out of the box, covering major global markets
Cost-effective: No API costs or server infrastructure required for prompt filtering

The AEGINEL Guard technology directly extends the AEGIS Guardrail product family, providing a complementary client-side protection layer alongside server-side safety systems.

Limitations

Label-specific accuracy: 66.7% — confusion between encoding_bypass, social_engineering, and script_evasion categories due to overlapping threat characteristics
Browser inference benchmarks: Actual Transformers.js WASM inference speed in Chrome not yet measured (server CPU baseline of 7.6 ms may differ)
Synthetic data bias: Training dataset is primarily synthetic; real-world prompt distribution may differ
Language coverage: 8 languages covered; additional languages require dataset expansion
Adversarial robustness: Not yet tested against advanced adversarial prompt techniques specifically designed to evade the classifier