AEGIS-TR-2026-002기술 보고서v1.0

AEGINEL Guard: Multilingual AI Prompt Security Classifier for Browser Extensions

Development of a lightweight on-device multi-label threat classifier for real-time prompt safety

저자: Kwangil Kim, AEGIS Research Team
발행일: 2026년 3월
소속: AEGIS Research
prompt-securityguardrailmultilingualon-devicebrowser-extensionONNXjailbreak-detection

요약

This research presents the full development pipeline of AEGINEL Guard, a multilingual AI prompt security classifier designed to run entirely on-device within Chrome browser extensions. The classifier detects six threat categories — Jailbreak, Prompt Injection, Harmful Content, Script Evasion, Social Engineering, and Encoding Bypass — using a multi-label classification approach. Trained on 188,109 samples across 8 languages, the final DistilBERT-based model achieves 100% binary detection accuracy at 7.6 ms/sample inference speed in a 129.5 MB INT8 ONNX package.

Problem

As LLM-based services like ChatGPT, Claude, and Gemini rapidly proliferate, users increasingly submit dangerous prompts — intentionally or unintentionally — that can bypass AI safety mechanisms or leak sensitive information. Existing server-side filtering approaches suffer from three critical limitations:

  1. Network latency — delayed threat detection
  2. Privacy violations — user prompts transmitted to external servers
  3. Service dependency — tied to specific AI service providers

AEGINEL Guard addresses these limitations by adopting a client-side, on-device inference pipeline that runs entirely within the user's browser, ensuring real-time protection without compromising privacy.

Our Approach

We defined the task as multi-label text classification across 7 categories (including "safe"), targeting 6 threat types:

  • Jailbreak — attempts to bypass AI safety restrictions
  • Prompt Injection — instructions to override system prompts
  • Harmful Content — requests for dangerous or illegal information
  • Script Evasion — code injection and XSS-style attacks
  • Social Engineering — manipulation through social context
  • Encoding Bypass — obfuscation techniques to evade filters

We built a custom dataset of 188,109 samples across 8 languages (Korean, English, Arabic, Spanish, Russian, Malay, Chinese, Japanese) and compared three Transformer models under browser extension deployment constraints (< 150 MB).

System Architecture

User Input (AI Service Textbox)
        │
        ▼
Content Script → Service Worker → Offscreen Document
        │                              │
        │                    Transformers.js + ONNX Runtime
        │                              │
        ▼                              ▼
Warning Banner ←── Classification Result

Key Contributions

  • Built a 188,109-sample, 8-language dataset for multi-label prompt security classification
  • Conducted comparative experiments across three multilingual Transformer models (XLM-RoBERTa, DistilBERT, mDeBERTa) with quantitative performance-size-speed tradeoff analysis
  • Achieved 100% binary detection accuracy (safe vs. harmful) with the final model
  • Delivered a 129.5 MB INT8 ONNX model with 7.6 ms/sample CPU inference — fully deployable in Chrome extensions via Transformers.js
  • Identified and resolved critical bugs in multi-label sigmoid inference and model path configuration

Key Findings

Model Comparison Results

ModelParametersF1-microF1-macroINT8 ONNX SizeTraining Time
XLM-RoBERTa-base278M0.999690.99974265.9 MB43 min
DistilBERT-multilingual135M0.997800.99813129.5 MB16 min
mDeBERTa-v3-base279M0.999690.99974322.8 MB52 min

Performance-Size Tradeoff

The F1-micro difference between the top-performing models (XLM-RoBERTa, mDeBERTa) and DistilBERT is only 0.189 percentage points, while DistilBERT is 2.05x smaller than XLM-RoBERTa and 2.49x smaller than mDeBERTa.

Inference Test Results

MetricValue
Label-specific accuracy66.7% (6/9)
Binary detection accuracy (safe vs. harmful)100% (9/9)
Multilingual prompt injection detection100% (3/3)
Average inference latency (CPU, INT8)7.6 ms/sample

All label confusion cases occurred within the same threat cluster — the binary "is this dangerous?" decision achieved perfect accuracy across all test prompts, including Korean, English, Japanese, and French inputs.

Evaluation Design

  • Dataset: 188,109 samples, self-constructed multilingual prompt security dataset
  • Languages: 8 (ko 24.9%, en 21.6%, ar 9.5%, es 9.1%, ru 9.1%, ms 8.8%, zh 8.5%, ja 8.5%)
  • Train/Val Split: 90:10 (169,298 / 18,811)
  • Multi-label samples: 21,544 (11.5%)
  • Loss Function: Binary Cross-Entropy with Logits
  • Inference Threshold: sigmoid >= 0.5
  • Optimization Metric: F1-micro
  • Quantization: INT8 Dynamic Quantization (AVX-512 VNNI target)
  • Hardware: NVIDIA RTX A5000 x2, CUDA 12.8

Final Model Selection Criteria

PriorityCriterionRequirementDistilBERT Result
1Deployment size< 150 MB129.5 MB
2Binary detection accuracy100%100%
3Inference speedReal-time7.6 ms/sample
4Label-specific accuracyBest effort99.78% F1-micro

Business Relevance

This research demonstrates that enterprise-grade prompt security can be delivered entirely on-device without server-side dependencies, privacy trade-offs, or network latency. Key implications for organizations:

  • Privacy-first AI safety: No user prompts leave the local device
  • Universal protection: Works across any LLM service (ChatGPT, Claude, Gemini, etc.)
  • Deployment-ready: Complete Chrome extension pipeline validated with production build
  • Multilingual coverage: 8 languages supported out of the box, covering major global markets
  • Cost-effective: No API costs or server infrastructure required for prompt filtering

The AEGINEL Guard technology directly extends the AEGIS Guardrail product family, providing a complementary client-side protection layer alongside server-side safety systems.

Limitations

  • Label-specific accuracy: 66.7% — confusion between encoding_bypass, social_engineering, and script_evasion categories due to overlapping threat characteristics
  • Browser inference benchmarks: Actual Transformers.js WASM inference speed in Chrome not yet measured (server CPU baseline of 7.6 ms may differ)
  • Synthetic data bias: Training dataset is primarily synthetic; real-world prompt distribution may differ
  • Language coverage: 8 languages covered; additional languages require dataset expansion
  • Adversarial robustness: Not yet tested against advanced adversarial prompt techniques specifically designed to evade the classifier

자료 및 다운로드

GitHub 저장소사용해 보기
경영진 요약준비 중
슬라이드 자료준비 중
데모 영상준비 중

이 연구를 적용하고 싶으신가요?

이 연구가 귀하의 AI 배포 요구를 어떻게 지원할 수 있는지 AEGIS Research 팀에 문의하세요.