AEGIS-TR-2026-002Technical Reportv1.0

AEGINEL Guard: Multilingual AI Prompt Security Classifier for Browser Extensions

Development of a lightweight on-device multi-label threat classifier for real-time prompt safety

Authors: Kwangil Kim, AEGIS Research Team
Published: March 2026
Affiliation: AEGIS Research
prompt-securityguardrailmultilingualon-devicebrowser-extensionONNXjailbreak-detection

Summary

This research presents the full development pipeline of AEGINEL Guard, a multilingual AI prompt security classifier designed to run entirely on-device within Chrome browser extensions. The classifier detects six threat categories — Jailbreak, Prompt Injection, Harmful Content, Script Evasion, Social Engineering, and Encoding Bypass — using a multi-label classification approach. Trained on 188,109 samples across 8 languages, the final DistilBERT-based model achieves 100% binary detection accuracy at 7.6 ms/sample inference speed in a 129.5 MB INT8 ONNX package.

Problem

As LLM-based services like ChatGPT, Claude, and Gemini rapidly proliferate, users increasingly submit dangerous prompts — intentionally or unintentionally — that can bypass AI safety mechanisms or leak sensitive information. Existing server-side filtering approaches suffer from three critical limitations:

  1. Network latency — delayed threat detection
  2. Privacy violations — user prompts transmitted to external servers
  3. Service dependency — tied to specific AI service providers

AEGINEL Guard addresses these limitations by adopting a client-side, on-device inference pipeline that runs entirely within the user's browser, ensuring real-time protection without compromising privacy.

Our Approach

We defined the task as multi-label text classification across 7 categories (including "safe"), targeting 6 threat types:

  • Jailbreak — attempts to bypass AI safety restrictions
  • Prompt Injection — instructions to override system prompts
  • Harmful Content — requests for dangerous or illegal information
  • Script Evasion — code injection and XSS-style attacks
  • Social Engineering — manipulation through social context
  • Encoding Bypass — obfuscation techniques to evade filters

We built a custom dataset of 188,109 samples across 8 languages (Korean, English, Arabic, Spanish, Russian, Malay, Chinese, Japanese) and compared three Transformer models under browser extension deployment constraints (< 150 MB).

System Architecture

User Input (AI Service Textbox)
        │
        ▼
Content Script → Service Worker → Offscreen Document
        │                              │
        │                    Transformers.js + ONNX Runtime
        │                              │
        ▼                              ▼
Warning Banner ←── Classification Result

Key Contributions

  • Built a 188,109-sample, 8-language dataset for multi-label prompt security classification
  • Conducted comparative experiments across three multilingual Transformer models (XLM-RoBERTa, DistilBERT, mDeBERTa) with quantitative performance-size-speed tradeoff analysis
  • Achieved 100% binary detection accuracy (safe vs. harmful) with the final model
  • Delivered a 129.5 MB INT8 ONNX model with 7.6 ms/sample CPU inference — fully deployable in Chrome extensions via Transformers.js
  • Identified and resolved critical bugs in multi-label sigmoid inference and model path configuration

Key Findings

Model Comparison Results

ModelParametersF1-microF1-macroINT8 ONNX SizeTraining Time
XLM-RoBERTa-base278M0.999690.99974265.9 MB43 min
DistilBERT-multilingual135M0.997800.99813129.5 MB16 min
mDeBERTa-v3-base279M0.999690.99974322.8 MB52 min

Performance-Size Tradeoff

The F1-micro difference between the top-performing models (XLM-RoBERTa, mDeBERTa) and DistilBERT is only 0.189 percentage points, while DistilBERT is 2.05x smaller than XLM-RoBERTa and 2.49x smaller than mDeBERTa.

Inference Test Results

MetricValue
Label-specific accuracy66.7% (6/9)
Binary detection accuracy (safe vs. harmful)100% (9/9)
Multilingual prompt injection detection100% (3/3)
Average inference latency (CPU, INT8)7.6 ms/sample

All label confusion cases occurred within the same threat cluster — the binary "is this dangerous?" decision achieved perfect accuracy across all test prompts, including Korean, English, Japanese, and French inputs.

Evaluation Design

  • Dataset: 188,109 samples, self-constructed multilingual prompt security dataset
  • Languages: 8 (ko 24.9%, en 21.6%, ar 9.5%, es 9.1%, ru 9.1%, ms 8.8%, zh 8.5%, ja 8.5%)
  • Train/Val Split: 90:10 (169,298 / 18,811)
  • Multi-label samples: 21,544 (11.5%)
  • Loss Function: Binary Cross-Entropy with Logits
  • Inference Threshold: sigmoid >= 0.5
  • Optimization Metric: F1-micro
  • Quantization: INT8 Dynamic Quantization (AVX-512 VNNI target)
  • Hardware: NVIDIA RTX A5000 x2, CUDA 12.8

Final Model Selection Criteria

PriorityCriterionRequirementDistilBERT Result
1Deployment size< 150 MB129.5 MB
2Binary detection accuracy100%100%
3Inference speedReal-time7.6 ms/sample
4Label-specific accuracyBest effort99.78% F1-micro

Business Relevance

This research demonstrates that enterprise-grade prompt security can be delivered entirely on-device without server-side dependencies, privacy trade-offs, or network latency. Key implications for organizations:

  • Privacy-first AI safety: No user prompts leave the local device
  • Universal protection: Works across any LLM service (ChatGPT, Claude, Gemini, etc.)
  • Deployment-ready: Complete Chrome extension pipeline validated with production build
  • Multilingual coverage: 8 languages supported out of the box, covering major global markets
  • Cost-effective: No API costs or server infrastructure required for prompt filtering

The AEGINEL Guard technology directly extends the AEGIS Guardrail product family, providing a complementary client-side protection layer alongside server-side safety systems.

Limitations

  • Label-specific accuracy: 66.7% — confusion between encoding_bypass, social_engineering, and script_evasion categories due to overlapping threat characteristics
  • Browser inference benchmarks: Actual Transformers.js WASM inference speed in Chrome not yet measured (server CPU baseline of 7.6 ms may differ)
  • Synthetic data bias: Training dataset is primarily synthetic; real-world prompt distribution may differ
  • Language coverage: 8 languages covered; additional languages require dataset expansion
  • Adversarial robustness: Not yet tested against advanced adversarial prompt techniques specifically designed to evade the classifier

Assets & Downloads

GitHub RepositoryTry It Out
Executive SummaryComing Soon
Slide DeckComing Soon
Demo VideoComing Soon

Interested in applying this research?

Contact the AEGIS Research team to learn how this work can support your AI deployment needs.