AEGIS-TR-2026-004기술 보고서v2.0

TurboQuant-Adam: Adaptive 4-bit Momentum and Activation Compression for LLM Distributed Training

8× communication reduction and 4× memory savings validated to 355M parameters with full convergence preservation

저자: Kwangil Kim, AEGIS Research Team
발행일: 2026년 3월
소속: AEGIS Research, Yatav Inc.
distributed-traininggradient-quantizationLloyd-MaxHadamard-transformAdamWcommunication-compressionmemory-optimization4-bit-quantizationLLM-trainingTriton-GPUerror-compensation

요약

We present TurboQuant-Adam v3, a unified compression framework for LLM distributed training that jointly addresses communication bandwidth and activation memory bottlenecks. The key insight is that standard Error Feedback is incompatible with AdamW's nonlinear variance scaling, and that variance freezing — effective at small scale — diverges at 124M+ parameters. Our solution applies 4-bit Lloyd-Max quantization (enabled by Hadamard rotation) at the momentum level with live variance tracking, achieving 8× communication reduction and 4× memory savings while preserving convergence (loss gap +0.002 at 1,500 steps on GPT-2 124M–355M).

Attribution: This research is an applied extension of TurboQuant (Zandieh et al., ICLR 2026), originally developed at Google Research, Google DeepMind, and New York University for distributed inference. We adapted and extended these ideas to the fundamentally different domain of distributed training. This project is not affiliated with the original TurboQuant authors, Google Research, Google DeepMind, or New York University — it is an independent downstream application of their publicly available research. All implementations are original work. See GitHub repository for source code (Apache 2.0).

Problem

Large-scale LLM distributed training faces two critical bottlenecks that severely limit training efficiency and scalability.

BottleneckRoot CauseImpact
Communication BandwidthFP32 gradient/momentum transfer during AllReduce — consumes 20–55% of training timeGPU idle time, InfiniBand dependency ($15M per 1,000-GPU cluster)
GPU Memory (VRAM)Activation tensors stored during forward pass — 40–60% of VRAMBatch size limits, excessive tensor parallelism, OOM errors

Existing gradient compression methods (QSGD, Deep Gradient Compression) assume SGD-based optimizers. The TurboQuant paper's Beta distribution assumption does not hold for real training tensors with heavy-tailed distributions and structural outliers. Most critically, no existing framework addresses both communication and memory bottlenecks simultaneously with AdamW compatibility.

Related Work

Gradient compression for distributed training has been explored along several axes. We position our work relative to the most relevant prior methods.

MethodCompression TargetBit-widthOptimizerError Comp.Activation Comp.Key Limitation
QSGD [3]GradientsVariableSGDNoNoSGD-only; no Adam support
Deep Gradient Compression [7]Gradients (sparse)Top-kSGDYesNoMomentum correction assumes linearity
1-bit Adam [2]Momentum1-bitAdamYesNov_t freezing diverges at 124M+ scale
THC [4]GradientsVariableSGDYesNoHadamard rotation without Lloyd-Max optimality
CompAct [8]ActivationsVariableAnyN/AYesActivation-only; no communication compression
TurboQuant [1]KV cache / vectors4-bitN/A (inference)N/AN/AInference-only; Beta distribution assumption
TurboQuant-Adam v3 (Ours)Momentum + Activations4-bitAdamWYes (m_t only)YesValidated to 355M only (see Limitations)

Key differentiators. (1) Most prior compression methods target SGD; adapting to AdamW requires resolving the nonlinear variance scaling problem we identify in this work. (2) 1-bit Adam [2] is the closest predecessor — we extend its momentum compression idea but replace variance freezing with live vtv_t updates, which we show is necessary at 124M+ scale. (3) No prior framework jointly compresses both communication (momentum) and memory (activations) for AdamW training. (4) THC [4] uses Hadamard rotation but with uniform quantization; our Lloyd-Max codebook achieves substantially lower MSE by exploiting the post-rotation Gaussian structure.

Our Approach

Research pipeline overview
Figure 1. Research pipeline — from Beta distribution hypothesis rejection through v_t scale-dependency discovery to the final v3 algorithm.

Theoretical Foundation

Walsh-Hadamard Transform → CLT Gaussianization. For any tensor gRdg \in \mathbb{R}^d.

g~=1dHdDgN(0,σ2)\tilde{g} = \frac{1}{\sqrt{d}} H_d \cdot D \cdot g \sim \mathcal{N}(0, \sigma^2)

By the Central Limit Theorem, the rotated tensor converges to Gaussian regardless of the original distribution shape, provided the dimension dd is sufficiently large (empirically, d256d \geq 256; we verify Gaussianity via Kolmogorov-Smirnov test across all layer tensors at each model scale — see Evaluation Design). This enables deployment of the theoretically optimal Lloyd-Max 4-bit quantizer, with a pre-computed N(0,1)\mathcal{N}(0,1) codebook scaled by σ\sigma (tracked via EMA), achieving 82% mean MSE reduction over uniform quantization (measured across GPT-2 Small gradient tensors; range: 74–89% depending on layer type).

Core algorithm pipeline
Figure 2. Hadamard rotation eliminates outliers and normalizes any distribution to Gaussian, enabling theoretically optimal Lloyd-Max 4-bit quantization.

AdamW–Error Feedback Incompatibility (Key Discovery)

Error Feedback failure mechanism
Figure 3. SGD's linear update preserves EF guarantees; AdamW's nonlinear denominator 1/(vt+ϵ)1/(\sqrt{v_t}+\epsilon) amplifies delayed noise, destroying momentum state.

Standard Error Feedback is valid only for SGD's linear update rule. In AdamW, quantization error et1e_{t-1} enters the squared term vt=β2vt1+(1β2)(gt+et1)2v_t = \beta_2 v_{t-1} + (1-\beta_2)(g_t + e_{t-1})^2, and the nonlinear denominator 1/(vt+ϵ)1/(\sqrt{v_t}+\epsilon) amplifies this noise, leading to training divergence (empirically confirmed: loss 1.010 vs. baseline 0.395 on GPT-2 Nano; see Table 1, row 2).

v_t Freezing: Scale-Dependent Failure (Critical Finding)

v_t strategy scale dependency
Figure 4. Variance freezing succeeds at ~1M parameters but diverges at 124M+ scale. Live v_t with error compensation is the only scale-invariant strategy.

The 1-bit Adam approach of freezing vtv_t after warmup works at small scale (~1M params) but diverges at 124M+ (loss gap +9.05 in our experiments; see Key Findings). We hypothesize that at larger scale, gradient distributions are sufficiently non-stationary that a frozen vtv_t snapshot becomes stale within hundreds of steps, causing the denominator to systematically mis-scale updates. This necessitates live vtv_t updates with error compensation applied exclusively at the momentum level.

TurboQuantAdamW v3: The Complete Algorithm

TurboQuantAdamW v3 algorithm structure
Figure 5. TurboQuantAdamW v3 — live v_t + momentum-level error compensation. Error never enters v_t path, preserving nonlinear denominator stability.

Three critical design decisions differentiate v3 from prior work.

DecisionRationale
Compress m_t (not g_t)Momentum is EMA-smoothed → higher quantization quality than raw gradients
Keep v_t live (not frozen)Frozen v_t diverges at 124M+ scale; live v_t tracks distribution shift
Error compensation at momentum levelError et1e_{t-1} enters mtm_t path only — never reaches vtv_t squared term → no nonlinear amplification

Communication protocol: Each node transmits only INT4-packed qtq_t via AllGather. vtv_t is computed locally from each node's own gradients — no v_t communication required.

Algorithm: TurboQuantAdamW v3

Algorithm 1. TurboQuantAdamW v3 — Momentum-level 4-bit compression with live variance


Input: parameters θ0\theta_0, learning rate η\eta, decay rates β1,β2\beta_1, \beta_2, weight decay λ\lambda, warmup steps TwT_w, Hadamard matrix HdH_d, random sign vector DD, Lloyd-Max codebook CN(0,1)C_{\mathcal{N}(0,1)}

Initialize: m0=0m_0 = 0, v0=0v_0 = 0, e0=0e_0 = 0 (error buffer), σ^0=0\hat{\sigma}_0 = 0 (EMA scale)

for t=1,2,t = 1, 2, \ldots do

  Compute gradient: gt=ft(θt1)g_t = \nabla f_t(\theta_{t-1})

  Update variance (live, from raw gradient): vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

  Update momentum (with error compensation): mt=β1mt1+(1β1)(gt+et1)m_t = \beta_1 m_{t-1} + (1 - \beta_1)(g_t + e_{t-1})

  if tTwt \leq T_w then

   Communicate mtm_t in full precision (warmup phase)

  else

   Rotate: m~t=1dHdDmt\tilde{m}_t = \frac{1}{\sqrt{d}} H_d \cdot D \cdot m_t

   Update scale: σ^t=EMA(std(m~t))\hat{\sigma}_t = \text{EMA}(\text{std}(\tilde{m}_t))

   Quantize: qt=LloydMax4-bit(m~t/σ^t)σ^tq_t = \text{LloydMax}_{4\text{-bit}}(\tilde{m}_t / \hat{\sigma}_t) \cdot \hat{\sigma}_t

   Compute error: et=m~tqte_t = \tilde{m}_t - q_t

   Inverse rotate: m^t=dDHdqt\hat{m}_t = \sqrt{d} \cdot D \cdot H_d \cdot q_t

   AllGather INT4-packed qtq_t across nodes

  end if

  Bias correction: m^t=m^t/(1β1t)\hat{m}_t = \hat{m}_t / (1 - \beta_1^t), v^t=vt/(1β2t)\hat{v}_t = v_t / (1 - \beta_2^t)

  Update: θt=(1λη)θt1ηm^t/(v^t+ϵ)\theta_t = (1 - \lambda\eta)\theta_{t-1} - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)

end for


Key invariant: Error ete_t enters only the mtm_t path (line 4). The variance vtv_t (line 3) is computed from raw gtg_t — quantization noise never reaches the nonlinear denominator 1/(vt+ϵ)1/(\sqrt{v_t} + \epsilon).

DDP Communication Hook Integration

DDP communication pipeline
Figure 6. Distributed communication pipeline — Hadamard rotation, 4-bit quantization, compressed INT4 AllGather, and inverse transform.

4-bit Activation Memory Compression

Activation compression architecture
Figure 7. TurboQuantLinear — Forward pass stores activations in 4-bit (4× memory savings); backward pass decompresses for gradient computation.

TurboQuantLinear is a custom autograd.Function layer that compresses activation tensors to 4-bit during forward pass using the same Hadamard + Lloyd-Max pipeline, and decompresses during backward pass for gradient computation. Combined activation + momentum compression achieves a loss gap of only +0.0001.

Key Contributions

  • Empirical disproof of Beta distribution assumption with strategic pivot to adaptive Gaussian codebooks
  • Hadamard + Lloyd-Max pipeline achieving 82% MSE reduction for heavy-tailed tensors
  • Discovery of EF–AdamW incompatibility — gradient-level EF fundamentally conflicts with Adam's nonlinear variance scaling
  • Discovery of v_t freezing scale dependency — 1-bit Adam's variance freezing diverges at 124M+ parameters
  • TurboQuantAdamW v3 — momentum-level 4-bit compression with live vtv_t and error compensation, validated to 355M parameters
  • TurboQuantLinear — 4-bit activation compression with near-zero accuracy loss (Δ=0.0001\Delta = 0.0001 when combined)
  • Triton GPU kernels — 8 custom kernels achieving 2.8× speedup with bit-exact accuracy vs. Python reference
  • INT4 AllGather hook — native PyTorch DDP integration for 8× bandwidth reduction

Key Findings

Comprehensive Experiment Results

Statistical note: GPU-scale experiments (GPT-2 Small 124M, GPT-2 Medium 355M) were conducted with 3 independent random seeds (42, 137, 2026). Results below report the mean across seeds. Standard deviations for loss gaps are ±0.003 (124M, 500 steps), ±0.001 (124M, 1,500 steps), and ±0.002 (355M, 500 steps). Concept validation experiments (GPT-2 Nano) used single runs due to their exploratory nature. We acknowledge that 3 seeds is the minimum for statistical reporting; 5+ seeds would strengthen confidence.

Comprehensive experiment results from 1M to 355M parameters
Figure 8. Full validation spectrum — from concept validation on M4 to GPU-scale experiments on 2× RTX 5000.

Table 1. Experiment results across model scales. GPU-scale results (rows 3–6) report mean over 3 seeds.

ExperimentModel / ScaleBaselineTQ LossGap (mean ± std)StepsStatus
4-bit Gradient (no EF)GPT-2 Nano / ~1M0.4020.756+0.354150Degraded
4-bit Gradient + EFGPT-2 Nano / ~1M0.3951.010+0.615150Failed
v_t Live + EC (v3)GPT-2 Small / 124M7.1237.142+0.019 ± 0.003500PASS
Long-run ConvergenceGPT-2 Small / 124M6.4566.457+0.002 ± 0.0011,500PASS
GPT-2 Medium + AMPGPT-2 Medium / 355M7.3577.360+0.002 ± 0.002500PASS
Combined (Act + Mom)Transformer / 70M5.8165.817+0.000800PASS

Long-Run Convergence Stability

1,500-step convergence stability
Figure 9. Loss gap remains bounded (0.0009–0.0024) across 1,500 steps — error compensation prevents accumulation.

The loss gap between FP32 baseline and TurboQuant 4-bit does not grow with training duration, confirming that error compensation effectively prevents error accumulation. This is critical for production training runs spanning 100K+ steps.

Triton GPU Kernel Performance

MetricPython ReferenceTriton KernelSpeedup
Round-trip latency (16M elements)73.47 ms26.21 ms2.8×
Communication payload64 MB (FP32)8 MB (INT4)8.0×
Numerical accuracyReferenceNumerically equivalent (≤10⁻⁶ relative error)

Evaluation Design

  • Concept validation: MacBook Air 15" (Apple M4, 24GB) / PyTorch MPS — rapid hypothesis testing
  • GPU validation: 2× NVIDIA Quadro RTX 5000 (16GB) / Intel i9-10980XE / PyTorch 2.11 / Triton 3.6 / CUDA 12.8
  • Models: GPT-2 Nano (~1M; custom 4-layer, 128-dim, 2-head configuration for rapid hypothesis testing), GPT-2 Small (124M), GPT-2 Medium (355M), custom Transformer (70M; 12-layer, 768-dim)
  • Training: WikiText-2, 150–1,500 steps per configuration, AdamW optimizer
  • Distribution validation: Kolmogorov-Smirnov test on Hadamard-rotated tensors across all model scales. All layers with d256d \geq 256 pass at α=0.05\alpha = 0.05 significance level (p>0.05p > 0.05), confirming the CLT-based Gaussian assumption. Small embedding layers (d<256d < 256) show marginal deviations but do not materially affect overall quantization quality
  • AMP compatibility: Verified with PyTorch Automatic Mixed Precision on GPT-2 Medium

Business Relevance

TurboQuant-Adam addresses the two largest cost drivers in LLM training. The estimates below are projections based on our 355M-scale results and standard industry cost assumptions — they have not been validated at production scale. See the companion whitepaper AEGIS-WP-2026-002 for detailed economic modeling and sensitivity analysis.

  • Communication cost: 8× bandwidth savings could enable Ethernet-based training, potentially saving ~8Mper1,000GPUclustervs.InfiniBand(assuming8M per 1,000-GPU cluster vs. InfiniBand (assuming 15K per InfiniBand HCA × 1,000 nodes, minus ~$7M commodity Ethernet equivalent)
  • Memory efficiency: 4× activation savings could enable 2–3× larger micro-batches or ~50% fewer GPUs for equivalent throughput (extrapolated from 355M activation compression ratios)
  • Training cost: Projected combined savings of 27–56% on 70B model training (240K240K–500K, assuming $2.50/GPU-hour, 90-day training, 512 H100 GPUs — see AEGIS-WP-2026-002 for full assumptions)
  • Hardware democratization: Makes multi-GPU LLM training viable on commodity Ethernet infrastructure

Caveat: These projections assume the 8× compression ratio and convergence properties observed at 355M scale will hold at 7B–70B. This has not been validated — see Limitations.

Open Source

The full implementation is available as an open-source Python package under Apache License 2.0.

git clone https://github.com/kwangilkimkenny/turboquant-adam.git
cd turboquant-adam
pip install -e .            # Base installation
pip install -e ".[triton]"  # With Triton GPU kernels

Drop-in usage — TurboQuantAdamW is a drop-in replacement for PyTorch AdamW.

from turboquant import TurboQuantAdamW

optimizer = TurboQuantAdamW(
    model.parameters(), lr=5e-4, weight_decay=0.01,
    warmup_steps=200, n_bits=4, use_triton=True,
)

See the GitHub repository for full documentation, DDP communication hook integration, and activation compression examples.

Convergence Analysis

We do not provide a formal convergence proof for TurboQuantAdamW v3 — this remains an important direction for future work. However, we offer the following informal analysis of why the algorithm preserves convergence properties.

Why momentum-level error compensation is safe. In standard AdamW, the parameter update is θt(1λη)θt1ηm^t/(v^t+ϵ)\theta_t \leftarrow (1-\lambda\eta)\theta_{t-1} - \eta \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon). The critical observation is that vtv_t serves as a per-parameter learning rate scaler. If quantization error contaminates vtv_t, the scaling becomes unreliable — small gradients may receive large updates and vice versa. By restricting error compensation to the mtm_t path, we ensure that vtv_t tracks the true gradient magnitude. The error in mtm_t is bounded: at 4-bit Lloyd-Max quantization with Hadamard pre-rotation, the expected MSE is E[et2]δ2m~t2\mathbb{E}[\|e_t\|^2] \leq \delta^2 \|\tilde{m}_t\|^2 where δ0.036\delta \approx 0.036 for the Gaussian-optimal 4-bit codebook (3.6% relative error per step).

Error non-accumulation argument. The error buffer et=m~tqte_t = \tilde{m}_t - q_t is added back at the next step (line 4 of Algorithm 1). Under the contraction property of EMA (β1<1\beta_1 < 1), past errors are exponentially decayed: the effective accumulated error after kk steps scales as δ2/(1β1)\delta^2 / (1 - \beta_1), which is bounded independently of training duration. Our 1,500-step experiment (Figure 9) empirically confirms this — the loss gap remains in the range [0.0009, 0.0024] without upward trend.

Limitations of this analysis. This argument is informal and assumes (1) the Hadamard rotation successfully Gaussianizes all tensors (verified empirically for d256d \geq 256), (2) the EMA scale tracker σ^t\hat{\sigma}_t adapts fast enough to distribution shift, and (3) vtv_t computed from local gradients (without error compensation) is a sufficiently accurate approximation of the global vtv_t. A rigorous convergence rate bound — particularly one that accounts for the interaction between momentum-level EC and AdamW's bias correction — would significantly strengthen this work.

Limitations and Honest Assessment

We want to be transparent about where this research stands. There is a meaningful gap between our current results and production readiness.

  • Model scale: Our largest model is 355M parameters. We cannot guarantee that these results will hold at 7B or 70B scale — this is the most critical open question
  • Training duration: 1,500 steps is very short compared to real LLM pretraining (100K+ steps). Long-term stability is unproven
  • Speed overhead: Quantization adds 2–4× computational overhead. Without C++/CUDA native kernels, the current implementation is not practical for production use
  • Single-node only: All benchmarks use 2 GPUs on NVLink where communication is not the bottleneck. Multi-node InfiniBand experiments — where the 8× compression would actually matter — have not been conducted
  • Small team, limited resources: This research was conducted on a MacBook Air and 2× RTX 5000. We lack access to the large-scale GPU clusters needed for definitive validation
  • Memory: Triton pipeline double-buffering increases temporary memory usage
  • FSDP/Megatron: Not yet tested with Fully Sharded Data Parallel or production tensor parallelism frameworks

References

  1. A. Zandieh (Google Research), M. Daliri (New York University), M. Hadian (Google DeepMind), V. Mirrokni (Google Research), "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate," ICLR, 2026. arXiv:2504.19874foundational work; proves random rotation induces Beta-distributed coordinates on the unit hypersphere, enabling MSE-optimal codebook design. Originally developed for distributed inference (KV cache compression, vector search). Our starting point for extending to distributed training.
  2. H. Tang et al., "1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed," ICML, 2021 — momentum compression + v_t freezing strategy; direct inspiration for TurboQuantAdamW, whose scale limitations motivated our v3 live-v_t architecture
  3. D. Alistarh et al., "QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding," NeurIPS, 2017 — stochastic quantization convergence theory
  4. M. Li, R. Ben Basat, S. Vargaftik, C. Lao et al., "THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression," USENIX NSDI, 2024 — Hadamard rotation-based communication compression; overhead benchmarks
  5. P. Richtárik et al., "EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback," NeurIPS, 2021 — Error Feedback convergence theory; referenced in EF failure analysis
  6. S. P. Karimireddy, Q. Rebjock, S. Stich, M. Jaggi, "Error Feedback Fixes SignSGD and other Gradient Compression Schemes," ICML, 2019 — theoretical background on biased compression and error feedback convergence with SGD
  7. Y. Lin et al., "Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training," ICLR, 2018 — momentum correction approach for gradient compression
  8. Y. Shamshoum, N. Hodos, Y. Sieradzki, A. Schuster, "CompAct: Compressed Activations for Memory-Efficient LLM Training," NAACL, 2025 — activation compression; comparison baseline
  9. T. Chen et al., "Training Deep Nets with Sublinear Memory Cost," arXiv, 2016 — gradient checkpointing
  10. S. Lloyd, "Least Squares Quantization in PCM," IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, 1982 — Lloyd-Max algorithm
  11. J. Max, "Quantizing for Minimum Distortion," IRE Trans. Inf. Theory, vol. 6, no. 1, pp. 7–12, 1960 — optimal scalar quantization theory

Acknowledgments

본 연구는 소규모 스타트업인 YATAV Inc.의 연구 조직 AEGIS Research에서 제한된 자원(MacBook Air M4, 2x RTX 5000)으로 수행한 소규모 프로젝트이다. 대규모 GPU 클러스터나 전담 연구 인력 없이, 공개된 선행 연구의 이론적 기반 위에서 독립적인 응용 확장을 시도하였다.

본 프로젝트는 무엇보다도 Amir Zandieh(Google Research), Majid Daliri(New York University), Majid Hadian(Google DeepMind), Vahab Mirrokni(Google Research)의 TurboQuant(ICLR 2026) 연구 없이는 불가능했을 것이다. 랜덤 회전이 거의 최적의 양자화를 가능하게 하는 베타 분포 좌표를 유도한다는 그들의 증명은 본 연구의 이론적 기반이자 직접적 영감이었다. 우리는 그들의 기여를 최고로 존중하며, 이 연구를 공개적으로 접근 가능하게 한 그들의 결정이 우리와 같은 소규모 독립 팀이 이러한 응용 확장을 시도할 수 있게 해주었다. 본 프로젝트의 모든 사용자가 우리의 응용 작업보다 훨씬 더 실질적인 기여를 하는 원 논문을 읽고 인용할 것을 강력히 권장한다.

또한 1-bit Adam(Hanlin Tang et al., ICML 2021)의 모멘텀 압축 및 분산 동결 전략은 TurboQuantAdamW의 직접적 설계 영감이 되었으며, 그 스케일 한계가 v3 아키텍처의 실시간 v_t 설계를 촉발하였다. QSGD(Alistarh et al., NeurIPS 2017)의 확률적 양자화 수렴 이론, THC(Li et al., USENIX NSDI 2024)의 아다마르 회전 기반 통신 압축, EF21(Richtarik et al., NeurIPS 2021)의 오류 피드백 수렴 이론, CompAct(Shamshoum et al., NAACL 2025)의 활성화 압축 접근법에도 깊이 감사드린다. 이 연구자들의 공개적 학술 기여가 있었기에 소규모 팀인 우리도 이 분야에서 의미 있는 시도를 할 수 있었다.

Disclaimer

본 프로젝트는 YATAV Inc. 산하 AEGIS Research의 독립적 응용 연구 활동이다. 원 TurboQuant 저자, Google Research, Google DeepMind, 뉴욕대학교와 제휴, 보증, 공식적 연관이 없다. "TurboQuant-Adam"이라는 명칭은 원 TurboQuant 논문으로부터의 지적 계보를 나타낸다. 본 코드베이스는 원 TurboQuant 프로젝트의 소스 코드를 포함하지 않으며 — 모든 구현은 자체 개발 결과물이다.

Citation

If you use this work, please cite both the original TurboQuant paper and this project.

@inproceedings{zandieh2026turboquant,
  title   = {TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author  = {Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year    = {2026}
}

@techreport{kim2026turboquantadam,
  title       = {TurboQuant-Adam: 4-bit Adaptive Momentum Compression for
                 Communication-Efficient LLM Training},
  author      = {Kim, Kwangil and {AEGIS Research Team}},
  year        = {2026},
  institution = {AEGIS Research, Yatav Inc.},
  number      = {AEGIS-TR-2026-004},
  url         = {https://github.com/kwangilkimkenny/turboquant-adam},
  note        = {Applied research extending TurboQuant (Zandieh et al., ICLR 2026)
                 from distributed inference to distributed training}
}

자료 및 다운로드

GitHub 저장소사용해 보기
경영진 요약준비 중
슬라이드 자료준비 중
데모 영상준비 중

이 연구를 적용하고 싶으신가요?

이 연구가 귀하의 AI 배포 요구를 어떻게 지원할 수 있는지 AEGIS Research 팀에 문의하세요.