Attribution: This research is an applied extension of TurboQuant (Zandieh et al., ICLR 2026), originally developed at Google Research, Google DeepMind, and New York University for distributed inference. We adapted and extended these ideas to the fundamentally different domain of distributed training. This project is not affiliated with the original TurboQuant authors, Google Research, Google DeepMind, or New York University — it is an independent downstream application of their publicly available research. All implementations are original work. See GitHub repository for source code (Apache 2.0).
Problem
Large-scale LLM distributed training faces two critical bottlenecks that severely limit training efficiency and scalability.
| Bottleneck | Root Cause | Impact |
|---|---|---|
| Communication Bandwidth | FP32 gradient/momentum transfer during AllReduce — consumes 20–55% of training time | GPU idle time, InfiniBand dependency ($15M per 1,000-GPU cluster) |
| GPU Memory (VRAM) | Activation tensors stored during forward pass — 40–60% of VRAM | Batch size limits, excessive tensor parallelism, OOM errors |
Existing gradient compression methods (QSGD, Deep Gradient Compression) assume SGD-based optimizers. The TurboQuant paper's Beta distribution assumption does not hold for real training tensors with heavy-tailed distributions and structural outliers. Most critically, no existing framework addresses both communication and memory bottlenecks simultaneously with AdamW compatibility.
Related Work
Gradient compression for distributed training has been explored along several axes. We position our work relative to the most relevant prior methods.
| Method | Compression Target | Bit-width | Optimizer | Error Comp. | Activation Comp. | Key Limitation |
|---|---|---|---|---|---|---|
| QSGD [3] | Gradients | Variable | SGD | No | No | SGD-only; no Adam support |
| Deep Gradient Compression [7] | Gradients (sparse) | Top-k | SGD | Yes | No | Momentum correction assumes linearity |
| 1-bit Adam [2] | Momentum | 1-bit | Adam | Yes | No | v_t freezing diverges at 124M+ scale |
| THC [4] | Gradients | Variable | SGD | Yes | No | Hadamard rotation without Lloyd-Max optimality |
| CompAct [8] | Activations | Variable | Any | N/A | Yes | Activation-only; no communication compression |
| TurboQuant [1] | KV cache / vectors | 4-bit | N/A (inference) | N/A | N/A | Inference-only; Beta distribution assumption |
| TurboQuant-Adam v3 (Ours) | Momentum + Activations | 4-bit | AdamW | Yes (m_t only) | Yes | Validated to 355M only (see Limitations) |
Key differentiators. (1) Most prior compression methods target SGD; adapting to AdamW requires resolving the nonlinear variance scaling problem we identify in this work. (2) 1-bit Adam [2] is the closest predecessor — we extend its momentum compression idea but replace variance freezing with live updates, which we show is necessary at 124M+ scale. (3) No prior framework jointly compresses both communication (momentum) and memory (activations) for AdamW training. (4) THC [4] uses Hadamard rotation but with uniform quantization; our Lloyd-Max codebook achieves substantially lower MSE by exploiting the post-rotation Gaussian structure.
Our Approach
Theoretical Foundation
Walsh-Hadamard Transform → CLT Gaussianization. For any tensor .
By the Central Limit Theorem, the rotated tensor converges to Gaussian regardless of the original distribution shape, provided the dimension is sufficiently large (empirically, ; we verify Gaussianity via Kolmogorov-Smirnov test across all layer tensors at each model scale — see Evaluation Design). This enables deployment of the theoretically optimal Lloyd-Max 4-bit quantizer, with a pre-computed codebook scaled by (tracked via EMA), achieving 82% mean MSE reduction over uniform quantization (measured across GPT-2 Small gradient tensors; range: 74–89% depending on layer type).
AdamW–Error Feedback Incompatibility (Key Discovery)
Standard Error Feedback is valid only for SGD's linear update rule. In AdamW, quantization error enters the squared term , and the nonlinear denominator amplifies this noise, leading to training divergence (empirically confirmed: loss 1.010 vs. baseline 0.395 on GPT-2 Nano; see Table 1, row 2).
v_t Freezing: Scale-Dependent Failure (Critical Finding)
The 1-bit Adam approach of freezing after warmup works at small scale (~1M params) but diverges at 124M+ (loss gap +9.05 in our experiments; see Key Findings). We hypothesize that at larger scale, gradient distributions are sufficiently non-stationary that a frozen snapshot becomes stale within hundreds of steps, causing the denominator to systematically mis-scale updates. This necessitates live updates with error compensation applied exclusively at the momentum level.
TurboQuantAdamW v3: The Complete Algorithm
Three critical design decisions differentiate v3 from prior work.
| Decision | Rationale |
|---|---|
| Compress m_t (not g_t) | Momentum is EMA-smoothed → higher quantization quality than raw gradients |
| Keep v_t live (not frozen) | Frozen v_t diverges at 124M+ scale; live v_t tracks distribution shift |
| Error compensation at momentum level | Error enters path only — never reaches squared term → no nonlinear amplification |
Communication protocol: Each node transmits only INT4-packed via AllGather. is computed locally from each node's own gradients — no v_t communication required.
Algorithm: TurboQuantAdamW v3
Algorithm 1. TurboQuantAdamW v3 — Momentum-level 4-bit compression with live variance
Input: parameters , learning rate , decay rates , weight decay , warmup steps , Hadamard matrix , random sign vector , Lloyd-Max codebook
Initialize: , , (error buffer), (EMA scale)
for do
Compute gradient:
Update variance (live, from raw gradient):
Update momentum (with error compensation):
if then
Communicate in full precision (warmup phase)
else
Rotate:
Update scale:
Quantize:
Compute error:
Inverse rotate:
AllGather INT4-packed across nodes
end if
Bias correction: ,
Update:
end for
Key invariant: Error enters only the path (line 4). The variance (line 3) is computed from raw — quantization noise never reaches the nonlinear denominator .
DDP Communication Hook Integration
4-bit Activation Memory Compression
TurboQuantLinear is a custom autograd.Function layer that compresses activation tensors to 4-bit during forward pass using the same Hadamard + Lloyd-Max pipeline, and decompresses during backward pass for gradient computation. Combined activation + momentum compression achieves a loss gap of only +0.0001.
Key Contributions
- Empirical disproof of Beta distribution assumption with strategic pivot to adaptive Gaussian codebooks
- Hadamard + Lloyd-Max pipeline achieving 82% MSE reduction for heavy-tailed tensors
- Discovery of EF–AdamW incompatibility — gradient-level EF fundamentally conflicts with Adam's nonlinear variance scaling
- Discovery of v_t freezing scale dependency — 1-bit Adam's variance freezing diverges at 124M+ parameters
- TurboQuantAdamW v3 — momentum-level 4-bit compression with live and error compensation, validated to 355M parameters
- TurboQuantLinear — 4-bit activation compression with near-zero accuracy loss ( when combined)
- Triton GPU kernels — 8 custom kernels achieving 2.8× speedup with bit-exact accuracy vs. Python reference
- INT4 AllGather hook — native PyTorch DDP integration for 8× bandwidth reduction
Key Findings
Comprehensive Experiment Results
Statistical note: GPU-scale experiments (GPT-2 Small 124M, GPT-2 Medium 355M) were conducted with 3 independent random seeds (42, 137, 2026). Results below report the mean across seeds. Standard deviations for loss gaps are ±0.003 (124M, 500 steps), ±0.001 (124M, 1,500 steps), and ±0.002 (355M, 500 steps). Concept validation experiments (GPT-2 Nano) used single runs due to their exploratory nature. We acknowledge that 3 seeds is the minimum for statistical reporting; 5+ seeds would strengthen confidence.
Table 1. Experiment results across model scales. GPU-scale results (rows 3–6) report mean over 3 seeds.
| Experiment | Model / Scale | Baseline | TQ Loss | Gap (mean ± std) | Steps | Status |
|---|---|---|---|---|---|---|
| 4-bit Gradient (no EF) | GPT-2 Nano / ~1M | 0.402 | 0.756 | +0.354 | 150 | Degraded |
| 4-bit Gradient + EF | GPT-2 Nano / ~1M | 0.395 | 1.010 | +0.615 | 150 | Failed |
| v_t Live + EC (v3) | GPT-2 Small / 124M | 7.123 | 7.142 | +0.019 ± 0.003 | 500 | PASS |
| Long-run Convergence | GPT-2 Small / 124M | 6.456 | 6.457 | +0.002 ± 0.001 | 1,500 | PASS |
| GPT-2 Medium + AMP | GPT-2 Medium / 355M | 7.357 | 7.360 | +0.002 ± 0.002 | 500 | PASS |
| Combined (Act + Mom) | Transformer / 70M | 5.816 | 5.817 | +0.000 | 800 | PASS |
Long-Run Convergence Stability
The loss gap between FP32 baseline and TurboQuant 4-bit does not grow with training duration, confirming that error compensation effectively prevents error accumulation. This is critical for production training runs spanning 100K+ steps.
Triton GPU Kernel Performance
| Metric | Python Reference | Triton Kernel | Speedup |
|---|---|---|---|
| Round-trip latency (16M elements) | 73.47 ms | 26.21 ms | 2.8× |
| Communication payload | 64 MB (FP32) | 8 MB (INT4) | 8.0× |
| Numerical accuracy | Reference | Numerically equivalent (≤10⁻⁶ relative error) | — |
Evaluation Design
- Concept validation: MacBook Air 15" (Apple M4, 24GB) / PyTorch MPS — rapid hypothesis testing
- GPU validation: 2× NVIDIA Quadro RTX 5000 (16GB) / Intel i9-10980XE / PyTorch 2.11 / Triton 3.6 / CUDA 12.8
- Models: GPT-2 Nano (~1M; custom 4-layer, 128-dim, 2-head configuration for rapid hypothesis testing), GPT-2 Small (124M), GPT-2 Medium (355M), custom Transformer (70M; 12-layer, 768-dim)
- Training: WikiText-2, 150–1,500 steps per configuration, AdamW optimizer
- Distribution validation: Kolmogorov-Smirnov test on Hadamard-rotated tensors across all model scales. All layers with pass at significance level (), confirming the CLT-based Gaussian assumption. Small embedding layers () show marginal deviations but do not materially affect overall quantization quality
- AMP compatibility: Verified with PyTorch Automatic Mixed Precision on GPT-2 Medium
Business Relevance
TurboQuant-Adam addresses the two largest cost drivers in LLM training. The estimates below are projections based on our 355M-scale results and standard industry cost assumptions — they have not been validated at production scale. See the companion whitepaper AEGIS-WP-2026-002 for detailed economic modeling and sensitivity analysis.
- Communication cost: 8× bandwidth savings could enable Ethernet-based training, potentially saving ~15K per InfiniBand HCA × 1,000 nodes, minus ~$7M commodity Ethernet equivalent)
- Memory efficiency: 4× activation savings could enable 2–3× larger micro-batches or ~50% fewer GPUs for equivalent throughput (extrapolated from 355M activation compression ratios)
- Training cost: Projected combined savings of 27–56% on 70B model training (500K, assuming $2.50/GPU-hour, 90-day training, 512 H100 GPUs — see AEGIS-WP-2026-002 for full assumptions)
- Hardware democratization: Makes multi-GPU LLM training viable on commodity Ethernet infrastructure
Caveat: These projections assume the 8× compression ratio and convergence properties observed at 355M scale will hold at 7B–70B. This has not been validated — see Limitations.
Open Source
The full implementation is available as an open-source Python package under Apache License 2.0.
git clone https://github.com/kwangilkimkenny/turboquant-adam.git
cd turboquant-adam
pip install -e . # Base installation
pip install -e ".[triton]" # With Triton GPU kernels
Drop-in usage — TurboQuantAdamW is a drop-in replacement for PyTorch AdamW.
from turboquant import TurboQuantAdamW
optimizer = TurboQuantAdamW(
model.parameters(), lr=5e-4, weight_decay=0.01,
warmup_steps=200, n_bits=4, use_triton=True,
)
See the GitHub repository for full documentation, DDP communication hook integration, and activation compression examples.
Convergence Analysis
We do not provide a formal convergence proof for TurboQuantAdamW v3 — this remains an important direction for future work. However, we offer the following informal analysis of why the algorithm preserves convergence properties.
Why momentum-level error compensation is safe. In standard AdamW, the parameter update is . The critical observation is that serves as a per-parameter learning rate scaler. If quantization error contaminates , the scaling becomes unreliable — small gradients may receive large updates and vice versa. By restricting error compensation to the path, we ensure that tracks the true gradient magnitude. The error in is bounded: at 4-bit Lloyd-Max quantization with Hadamard pre-rotation, the expected MSE is where for the Gaussian-optimal 4-bit codebook (3.6% relative error per step).
Error non-accumulation argument. The error buffer is added back at the next step (line 4 of Algorithm 1). Under the contraction property of EMA (), past errors are exponentially decayed: the effective accumulated error after steps scales as , which is bounded independently of training duration. Our 1,500-step experiment (Figure 9) empirically confirms this — the loss gap remains in the range [0.0009, 0.0024] without upward trend.
Limitations of this analysis. This argument is informal and assumes (1) the Hadamard rotation successfully Gaussianizes all tensors (verified empirically for ), (2) the EMA scale tracker adapts fast enough to distribution shift, and (3) computed from local gradients (without error compensation) is a sufficiently accurate approximation of the global . A rigorous convergence rate bound — particularly one that accounts for the interaction between momentum-level EC and AdamW's bias correction — would significantly strengthen this work.
Limitations and Honest Assessment
We want to be transparent about where this research stands. There is a meaningful gap between our current results and production readiness.
- Model scale: Our largest model is 355M parameters. We cannot guarantee that these results will hold at 7B or 70B scale — this is the most critical open question
- Training duration: 1,500 steps is very short compared to real LLM pretraining (100K+ steps). Long-term stability is unproven
- Speed overhead: Quantization adds 2–4× computational overhead. Without C++/CUDA native kernels, the current implementation is not practical for production use
- Single-node only: All benchmarks use 2 GPUs on NVLink where communication is not the bottleneck. Multi-node InfiniBand experiments — where the 8× compression would actually matter — have not been conducted
- Small team, limited resources: This research was conducted on a MacBook Air and 2× RTX 5000. We lack access to the large-scale GPU clusters needed for definitive validation
- Memory: Triton pipeline double-buffering increases temporary memory usage
- FSDP/Megatron: Not yet tested with Fully Sharded Data Parallel or production tensor parallelism frameworks
References
- A. Zandieh (Google Research), M. Daliri (New York University), M. Hadian (Google DeepMind), V. Mirrokni (Google Research), "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate," ICLR, 2026. arXiv:2504.19874 — foundational work; proves random rotation induces Beta-distributed coordinates on the unit hypersphere, enabling MSE-optimal codebook design. Originally developed for distributed inference (KV cache compression, vector search). Our starting point for extending to distributed training.
- H. Tang et al., "1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed," ICML, 2021 — momentum compression + v_t freezing strategy; direct inspiration for TurboQuantAdamW, whose scale limitations motivated our v3 live-v_t architecture
- D. Alistarh et al., "QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding," NeurIPS, 2017 — stochastic quantization convergence theory
- M. Li, R. Ben Basat, S. Vargaftik, C. Lao et al., "THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression," USENIX NSDI, 2024 — Hadamard rotation-based communication compression; overhead benchmarks
- P. Richtárik et al., "EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback," NeurIPS, 2021 — Error Feedback convergence theory; referenced in EF failure analysis
- S. P. Karimireddy, Q. Rebjock, S. Stich, M. Jaggi, "Error Feedback Fixes SignSGD and other Gradient Compression Schemes," ICML, 2019 — theoretical background on biased compression and error feedback convergence with SGD
- Y. Lin et al., "Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training," ICLR, 2018 — momentum correction approach for gradient compression
- Y. Shamshoum, N. Hodos, Y. Sieradzki, A. Schuster, "CompAct: Compressed Activations for Memory-Efficient LLM Training," NAACL, 2025 — activation compression; comparison baseline
- T. Chen et al., "Training Deep Nets with Sublinear Memory Cost," arXiv, 2016 — gradient checkpointing
- S. Lloyd, "Least Squares Quantization in PCM," IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, 1982 — Lloyd-Max algorithm
- J. Max, "Quantizing for Minimum Distortion," IRE Trans. Inf. Theory, vol. 6, no. 1, pp. 7–12, 1960 — optimal scalar quantization theory
Acknowledgments
본 연구는 소규모 스타트업인 YATAV Inc.의 연구 조직 AEGIS Research에서 제한된 자원(MacBook Air M4, 2x RTX 5000)으로 수행한 소규모 프로젝트이다. 대규모 GPU 클러스터나 전담 연구 인력 없이, 공개된 선행 연구의 이론적 기반 위에서 독립적인 응용 확장을 시도하였다.
본 프로젝트는 무엇보다도 Amir Zandieh(Google Research), Majid Daliri(New York University), Majid Hadian(Google DeepMind), Vahab Mirrokni(Google Research)의 TurboQuant(ICLR 2026) 연구 없이는 불가능했을 것이다. 랜덤 회전이 거의 최적의 양자화를 가능하게 하는 베타 분포 좌표를 유도한다는 그들의 증명은 본 연구의 이론적 기반이자 직접적 영감이었다. 우리는 그들의 기여를 최고로 존중하며, 이 연구를 공개적으로 접근 가능하게 한 그들의 결정이 우리와 같은 소규모 독립 팀이 이러한 응용 확장을 시도할 수 있게 해주었다. 본 프로젝트의 모든 사용자가 우리의 응용 작업보다 훨씬 더 실질적인 기여를 하는 원 논문을 읽고 인용할 것을 강력히 권장한다.
또한 1-bit Adam(Hanlin Tang et al., ICML 2021)의 모멘텀 압축 및 분산 동결 전략은 TurboQuantAdamW의 직접적 설계 영감이 되었으며, 그 스케일 한계가 v3 아키텍처의 실시간 v_t 설계를 촉발하였다. QSGD(Alistarh et al., NeurIPS 2017)의 확률적 양자화 수렴 이론, THC(Li et al., USENIX NSDI 2024)의 아다마르 회전 기반 통신 압축, EF21(Richtarik et al., NeurIPS 2021)의 오류 피드백 수렴 이론, CompAct(Shamshoum et al., NAACL 2025)의 활성화 압축 접근법에도 깊이 감사드린다. 이 연구자들의 공개적 학술 기여가 있었기에 소규모 팀인 우리도 이 분야에서 의미 있는 시도를 할 수 있었다.
Disclaimer
본 프로젝트는 YATAV Inc. 산하 AEGIS Research의 독립적 응용 연구 활동이다. 원 TurboQuant 저자, Google Research, Google DeepMind, 뉴욕대학교와 제휴, 보증, 공식적 연관이 없다. "TurboQuant-Adam"이라는 명칭은 원 TurboQuant 논문으로부터의 지적 계보를 나타낸다. 본 코드베이스는 원 TurboQuant 프로젝트의 소스 코드를 포함하지 않으며 — 모든 구현은 자체 개발 결과물이다.
Citation
If you use this work, please cite both the original TurboQuant paper and this project.
@inproceedings{zandieh2026turboquant,
title = {TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author = {Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}
@techreport{kim2026turboquantadam,
title = {TurboQuant-Adam: 4-bit Adaptive Momentum Compression for
Communication-Efficient LLM Training},
author = {Kim, Kwangil and {AEGIS Research Team}},
year = {2026},
institution = {AEGIS Research, Yatav Inc.},
number = {AEGIS-TR-2026-004},
url = {https://github.com/kwangilkimkenny/turboquant-adam},
note = {Applied research extending TurboQuant (Zandieh et al., ICLR 2026)
from distributed inference to distributed training}
}