TurboQuant-Adam: Adaptive 4-bit Momentum and Activation Compression for LLM Distributed Training

Attribution: This research is an applied extension of TurboQuant (Zandieh et al., ICLR 2026), originally developed at Google Research, Google DeepMind, and New York University for distributed inference. We adapted and extended these ideas to the fundamentally different domain of distributed training. This project is not affiliated with the original TurboQuant authors, Google Research, Google DeepMind, or New York University — it is an independent downstream application of their publicly available research. All implementations are original work. See GitHub repository for source code (Apache 2.0).

Problem

Large-scale LLM distributed training faces two critical bottlenecks that severely limit training efficiency and scalability.

Bottleneck	Root Cause	Impact
Communication Bandwidth	FP32 gradient/momentum transfer during AllReduce — consumes 20–55% of training time	GPU idle time, InfiniBand dependency ($15M per 1,000-GPU cluster)
GPU Memory (VRAM)	Activation tensors stored during forward pass — 40–60% of VRAM	Batch size limits, excessive tensor parallelism, OOM errors

Existing gradient compression methods (QSGD, Deep Gradient Compression) assume SGD-based optimizers. The TurboQuant paper's Beta distribution assumption does not hold for real training tensors with heavy-tailed distributions and structural outliers. Most critically, no existing framework addresses both communication and memory bottlenecks simultaneously with AdamW compatibility.

Related Work

Gradient compression for distributed training has been explored along several axes. We position our work relative to the most relevant prior methods.

Method	Compression Target	Bit-width	Optimizer	Error Comp.	Activation Comp.	Key Limitation
QSGD [3]	Gradients	Variable	SGD	No	No	SGD-only; no Adam support
Deep Gradient Compression [7]	Gradients (sparse)	Top-k	SGD	Yes	No	Momentum correction assumes linearity
1-bit Adam [2]	Momentum	1-bit	Adam	Yes	No	v_t freezing diverges at 124M+ scale
THC [4]	Gradients	Variable	SGD	Yes	No	Hadamard rotation without Lloyd-Max optimality
CompAct [8]	Activations	Variable	Any	N/A	Yes	Activation-only; no communication compression
TurboQuant [1]	KV cache / vectors	4-bit	N/A (inference)	N/A	N/A	Inference-only; Beta distribution assumption
TurboQuant-Adam v3 (Ours)	Momentum + Activations	4-bit	AdamW	Yes (m_t only)	Yes	Validated to 355M only (see Limitations)

Key differentiators. (1) Most prior compression methods target SGD; adapting to AdamW requires resolving the nonlinear variance scaling problem we identify in this work. (2) 1-bit Adam [2] is the closest predecessor — we extend its momentum compression idea but replace variance freezing with live $v_t$ updates, which we show is necessary at 124M+ scale. (3) No prior framework jointly compresses both communication (momentum) and memory (activations) for AdamW training. (4) THC [4] uses Hadamard rotation but with uniform quantization; our Lloyd-Max codebook achieves substantially lower MSE by exploiting the post-rotation Gaussian structure.

Our Approach

Research pipeline overview — Figure 1. Research pipeline — from Beta distribution hypothesis rejection through v_t scale-dependency discovery to the final v3 algorithm.

Theoretical Foundation

Walsh-Hadamard Transform → CLT Gaussianization. For any tensor $g \in \mathbb{R}^d$ .

$\tilde{g} = \frac{1}{\sqrt{d}} H_d \cdot D \cdot g \sim \mathcal{N}(0, \sigma^2)$

By the Central Limit Theorem, the rotated tensor converges to Gaussian regardless of the original distribution shape, provided the dimension $d$ is sufficiently large (empirically, $d \geq 256$ ; we verify Gaussianity via Kolmogorov-Smirnov test across all layer tensors at each model scale — see Evaluation Design). This enables deployment of the theoretically optimal Lloyd-Max 4-bit quantizer, with a pre-computed $\mathcal{N}(0,1)$ codebook scaled by $\sigma$ (tracked via EMA), achieving 82% mean MSE reduction over uniform quantization (measured across GPT-2 Small gradient tensors; range: 74–89% depending on layer type).

Core algorithm pipeline — Figure 2. Hadamard rotation eliminates outliers and normalizes any distribution to Gaussian, enabling theoretically optimal Lloyd-Max 4-bit quantization.

AdamW–Error Feedback Incompatibility (Key Discovery)

Error Feedback failure mechanism — Figure 3. SGD's linear update preserves EF guarantees; AdamW's nonlinear denominator $1/(\sqrt{v_t}+\epsilon)$ amplifies delayed noise, destroying momentum state.

Standard Error Feedback is valid only for SGD's linear update rule. In AdamW, quantization error $e_{t-1}$ enters the squared term $v_t = \beta_2 v_{t-1} + (1-\beta_2)(g_t + e_{t-1})^2$ , and the nonlinear denominator $1/(\sqrt{v_t}+\epsilon)$ amplifies this noise, leading to training divergence (empirically confirmed: loss 1.010 vs. baseline 0.395 on GPT-2 Nano; see Table 1, row 2).

v_t Freezing: Scale-Dependent Failure (Critical Finding)

v_t strategy scale dependency — Figure 4. Variance freezing succeeds at ~1M parameters but diverges at 124M+ scale. Live v_t with error compensation is the only scale-invariant strategy.

The 1-bit Adam approach of freezing $v_t$ after warmup works at small scale (~1M params) but diverges at 124M+ (loss gap +9.05 in our experiments; see Key Findings). We hypothesize that at larger scale, gradient distributions are sufficiently non-stationary that a frozen $v_t$ snapshot becomes stale within hundreds of steps, causing the denominator to systematically mis-scale updates. This necessitates live $v_t$ updates with error compensation applied exclusively at the momentum level.

TurboQuantAdamW v3: The Complete Algorithm

TurboQuantAdamW v3 algorithm structure — Figure 5. TurboQuantAdamW v3 — live v_t + momentum-level error compensation. Error never enters v_t path, preserving nonlinear denominator stability.

Three critical design decisions differentiate v3 from prior work.

Decision	Rationale
Compress m_t (not g_t)	Momentum is EMA-smoothed → higher quantization quality than raw gradients
Keep v_t live (not frozen)	Frozen v_t diverges at 124M+ scale; live v_t tracks distribution shift
Error compensation at momentum level	Error $e_{t-1}$ enters $m_t$ path only — never reaches $v_t$ squared term → no nonlinear amplification

Communication protocol: Each node transmits only INT4-packed $q_t$ via AllGather. $v_t$ is computed locally from each node's own gradients — no v_t communication required.

Algorithm: TurboQuantAdamW v3

Algorithm 1. TurboQuantAdamW v3 — Momentum-level 4-bit compression with live variance

Input: parameters $\theta_0$ , learning rate $\eta$ , decay rates $\beta_1, \beta_2$ , weight decay $\lambda$ , warmup steps $T_w$ , Hadamard matrix $H_d$ , random sign vector $D$ , Lloyd-Max codebook $C_{\mathcal{N}(0,1)}$

Initialize: $m_0 = 0$ , $v_0 = 0$ , $e_0 = 0$ (error buffer), $\hat{\sigma}_0 = 0$ (EMA scale)

for $t = 1, 2, \ldots$ do

Compute gradient: $g_t = \nabla f_t(\theta_{t-1})$

Update variance (live, from raw gradient): $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$

Update momentum (with error compensation): $m_t = \beta_1 m_{t-1} + (1 - \beta_1)(g_t + e_{t-1})$

if $t \leq T_w$ then

Communicate $m_t$ in full precision (warmup phase)

else

Rotate: $\tilde{m}_t = \frac{1}{\sqrt{d}} H_d \cdot D \cdot m_t$

Update scale: $\hat{\sigma}_t = \text{EMA}(\text{std}(\tilde{m}_t))$

Quantize: $q_t = \text{LloydMax}_{4\text{-bit}}(\tilde{m}_t / \hat{\sigma}_t) \cdot \hat{\sigma}_t$

Compute error: $e_t = \tilde{m}_t - q_t$

Inverse rotate: $\hat{m}_t = \sqrt{d} \cdot D \cdot H_d \cdot q_t$

AllGather INT4-packed $q_t$ across nodes

end if

Bias correction: $\hat{m}_t = \hat{m}_t / (1 - \beta_1^t)$ , $\hat{v}_t = v_t / (1 - \beta_2^t)$

Update: $\theta_t = (1 - \lambda\eta)\theta_{t-1} - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$

end for

Key invariant: Error $e_t$ enters only the $m_t$ path (line 4). The variance $v_t$ (line 3) is computed from raw $g_t$ — quantization noise never reaches the nonlinear denominator $1/(\sqrt{v_t} + \epsilon)$ .

DDP Communication Hook Integration

DDP communication pipeline — Figure 6. Distributed communication pipeline — Hadamard rotation, 4-bit quantization, compressed INT4 AllGather, and inverse transform.

4-bit Activation Memory Compression

Activation compression architecture — Figure 7. TurboQuantLinear — Forward pass stores activations in 4-bit (4× memory savings); backward pass decompresses for gradient computation.

TurboQuantLinear is a custom autograd.Function layer that compresses activation tensors to 4-bit during forward pass using the same Hadamard + Lloyd-Max pipeline, and decompresses during backward pass for gradient computation. Combined activation + momentum compression achieves a loss gap of only +0.0001.

Key Contributions

Empirical disproof of Beta distribution assumption with strategic pivot to adaptive Gaussian codebooks
Hadamard + Lloyd-Max pipeline achieving 82% MSE reduction for heavy-tailed tensors
Discovery of EF–AdamW incompatibility — gradient-level EF fundamentally conflicts with Adam's nonlinear variance scaling
Discovery of v_t freezing scale dependency — 1-bit Adam's variance freezing diverges at 124M+ parameters
TurboQuantAdamW v3 — momentum-level 4-bit compression with live $v_t$ and error compensation, validated to 355M parameters
TurboQuantLinear — 4-bit activation compression with near-zero accuracy loss ( $\Delta = 0.0001$ when combined)
Triton GPU kernels — 8 custom kernels achieving 2.8× speedup with bit-exact accuracy vs. Python reference
INT4 AllGather hook — native PyTorch DDP integration for 8× bandwidth reduction

Key Findings

Comprehensive Experiment Results

Statistical note: GPU-scale experiments (GPT-2 Small 124M, GPT-2 Medium 355M) were conducted with 3 independent random seeds (42, 137, 2026). Results below report the mean across seeds. Standard deviations for loss gaps are ±0.003 (124M, 500 steps), ±0.001 (124M, 1,500 steps), and ±0.002 (355M, 500 steps). Concept validation experiments (GPT-2 Nano) used single runs due to their exploratory nature. We acknowledge that 3 seeds is the minimum for statistical reporting; 5+ seeds would strengthen confidence.

Table 1. Experiment results across model scales. GPU-scale results (rows 3–6) report mean over 3 seeds.

Experiment	Model / Scale	Baseline	TQ Loss	Gap (mean ± std)	Steps	Status
4-bit Gradient (no EF)	GPT-2 Nano / ~1M	0.402	0.756	+0.354	150	Degraded
4-bit Gradient + EF	GPT-2 Nano / ~1M	0.395	1.010	+0.615	150	Failed
v_t Live + EC (v3)	GPT-2 Small / 124M	7.123	7.142	+0.019 ± 0.003	500	PASS
Long-run Convergence	GPT-2 Small / 124M	6.456	6.457	+0.002 ± 0.001	1,500	PASS
GPT-2 Medium + AMP	GPT-2 Medium / 355M	7.357	7.360	+0.002 ± 0.002	500	PASS
Combined (Act + Mom)	Transformer / 70M	5.816	5.817	+0.000	800	PASS

Long-Run Convergence Stability

1,500-step convergence stability — Figure 9. Loss gap remains bounded (0.0009–0.0024) across 1,500 steps — error compensation prevents accumulation.

The loss gap between FP32 baseline and TurboQuant 4-bit does not grow with training duration, confirming that error compensation effectively prevents error accumulation. This is critical for production training runs spanning 100K+ steps.

Triton GPU Kernel Performance

Metric	Python Reference	Triton Kernel	Speedup
Round-trip latency (16M elements)	73.47 ms	26.21 ms	2.8×
Communication payload	64 MB (FP32)	8 MB (INT4)	8.0×
Numerical accuracy	Reference	Numerically equivalent (≤10⁻⁶ relative error)	—

Evaluation Design

Concept validation: MacBook Air 15" (Apple M4, 24GB) / PyTorch MPS — rapid hypothesis testing
GPU validation: 2× NVIDIA Quadro RTX 5000 (16GB) / Intel i9-10980XE / PyTorch 2.11 / Triton 3.6 / CUDA 12.8
Models: GPT-2 Nano (~1M; custom 4-layer, 128-dim, 2-head configuration for rapid hypothesis testing), GPT-2 Small (124M), GPT-2 Medium (355M), custom Transformer (70M; 12-layer, 768-dim)
Training: WikiText-2, 150–1,500 steps per configuration, AdamW optimizer
Distribution validation: Kolmogorov-Smirnov test on Hadamard-rotated tensors across all model scales. All layers with $d \geq 256$ pass at $\alpha = 0.05$ significance level ( $p > 0.05$ ), confirming the CLT-based Gaussian assumption. Small embedding layers ( $d < 256$ ) show marginal deviations but do not materially affect overall quantization quality
AMP compatibility: Verified with PyTorch Automatic Mixed Precision on GPT-2 Medium

Business Relevance

TurboQuant-Adam addresses the two largest cost drivers in LLM training. The estimates below are projections based on our 355M-scale results and standard industry cost assumptions — they have not been validated at production scale. See the companion whitepaper AEGIS-WP-2026-002 for detailed economic modeling and sensitivity analysis.

Communication cost: 8× bandwidth savings could enable Ethernet-based training, potentially saving ~ $8M per 1,000-GPU cluster vs. InfiniBand (assuming$ 15K per InfiniBand HCA × 1,000 nodes, minus ~$7M commodity Ethernet equivalent)
Memory efficiency: 4× activation savings could enable 2–3× larger micro-batches or ~50% fewer GPUs for equivalent throughput (extrapolated from 355M activation compression ratios)
Training cost: Projected combined savings of 27–56% on 70B model training ( $240K–$ 500K, assuming $2.50/GPU-hour, 90-day training, 512 H100 GPUs — see AEGIS-WP-2026-002 for full assumptions)
Hardware democratization: Makes multi-GPU LLM training viable on commodity Ethernet infrastructure

Caveat: These projections assume the 8× compression ratio and convergence properties observed at 355M scale will hold at 7B–70B. This has not been validated — see Limitations.

Open Source

The full implementation is available as an open-source Python package under Apache License 2.0.

git clone https://github.com/kwangilkimkenny/turboquant-adam.git
cd turboquant-adam
pip install -e .            # Base installation
pip install -e ".[triton]"  # With Triton GPU kernels

Drop-in usage — TurboQuantAdamW is a drop-in replacement for PyTorch AdamW.

from turboquant import TurboQuantAdamW

optimizer = TurboQuantAdamW(
    model.parameters(), lr=5e-4, weight_decay=0.01,
    warmup_steps=200, n_bits=4, use_triton=True,
)

See the GitHub repository for full documentation, DDP communication hook integration, and activation compression examples.

Convergence Analysis

We do not provide a formal convergence proof for TurboQuantAdamW v3 — this remains an important direction for future work. However, we offer the following informal analysis of why the algorithm preserves convergence properties.

Why momentum-level error compensation is safe. In standard AdamW, the parameter update is $\theta_t \leftarrow (1-\lambda\eta)\theta_{t-1} - \eta \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$ . The critical observation is that $v_t$ serves as a per-parameter learning rate scaler. If quantization error contaminates $v_t$ , the scaling becomes unreliable — small gradients may receive large updates and vice versa. By restricting error compensation to the $m_t$ path, we ensure that $v_t$ tracks the true gradient magnitude. The error in $m_t$ is bounded: at 4-bit Lloyd-Max quantization with Hadamard pre-rotation, the expected MSE is $\mathbb{E}[\|e_t\|^2] \leq \delta^2 \|\tilde{m}_t\|^2$ where $\delta \approx 0.036$ for the Gaussian-optimal 4-bit codebook (3.6% relative error per step).

Error non-accumulation argument. The error buffer $e_t = \tilde{m}_t - q_t$ is added back at the next step (line 4 of Algorithm 1). Under the contraction property of EMA ( $\beta_1 < 1$ ), past errors are exponentially decayed: the effective accumulated error after $k$ steps scales as $\delta^2 / (1 - \beta_1)$ , which is bounded independently of training duration. Our 1,500-step experiment (Figure 9) empirically confirms this — the loss gap remains in the range [0.0009, 0.0024] without upward trend.

Limitations of this analysis. This argument is informal and assumes (1) the Hadamard rotation successfully Gaussianizes all tensors (verified empirically for $d \geq 256$ ), (2) the EMA scale tracker $\hat{\sigma}_t$ adapts fast enough to distribution shift, and (3) $v_t$ computed from local gradients (without error compensation) is a sufficiently accurate approximation of the global $v_t$ . A rigorous convergence rate bound — particularly one that accounts for the interaction between momentum-level EC and AdamW's bias correction — would significantly strengthen this work.

Limitations and Honest Assessment

We want to be transparent about where this research stands. There is a meaningful gap between our current results and production readiness.

Model scale: Our largest model is 355M parameters. We cannot guarantee that these results will hold at 7B or 70B scale — this is the most critical open question
Training duration: 1,500 steps is very short compared to real LLM pretraining (100K+ steps). Long-term stability is unproven
Speed overhead: Quantization adds 2–4× computational overhead. Without C++/CUDA native kernels, the current implementation is not practical for production use
Single-node only: All benchmarks use 2 GPUs on NVLink where communication is not the bottleneck. Multi-node InfiniBand experiments — where the 8× compression would actually matter — have not been conducted
Small team, limited resources: This research was conducted on a MacBook Air and 2× RTX 5000. We lack access to the large-scale GPU clusters needed for definitive validation
Memory: Triton pipeline double-buffering increases temporary memory usage
FSDP/Megatron: Not yet tested with Fully Sharded Data Parallel or production tensor parallelism frameworks

References

A. Zandieh (Google Research), M. Daliri (New York University), M. Hadian (Google DeepMind), V. Mirrokni (Google Research), "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate," ICLR, 2026. arXiv:2504.19874 — foundational work; proves random rotation induces Beta-distributed coordinates on the unit hypersphere, enabling MSE-optimal codebook design. Originally developed for distributed inference (KV cache compression, vector search). Our starting point for extending to distributed training.
H. Tang et al., "1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed," ICML, 2021 — momentum compression + v_t freezing strategy; direct inspiration for TurboQuantAdamW, whose scale limitations motivated our v3 live-v_t architecture
D. Alistarh et al., "QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding," NeurIPS, 2017 — stochastic quantization convergence theory
M. Li, R. Ben Basat, S. Vargaftik, C. Lao et al., "THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression," USENIX NSDI, 2024 — Hadamard rotation-based communication compression; overhead benchmarks
P. Richtárik et al., "EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback," NeurIPS, 2021 — Error Feedback convergence theory; referenced in EF failure analysis
S. P. Karimireddy, Q. Rebjock, S. Stich, M. Jaggi, "Error Feedback Fixes SignSGD and other Gradient Compression Schemes," ICML, 2019 — theoretical background on biased compression and error feedback convergence with SGD
Y. Lin et al., "Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training," ICLR, 2018 — momentum correction approach for gradient compression
Y. Shamshoum, N. Hodos, Y. Sieradzki, A. Schuster, "CompAct: Compressed Activations for Memory-Efficient LLM Training," NAACL, 2025 — activation compression; comparison baseline
T. Chen et al., "Training Deep Nets with Sublinear Memory Cost," arXiv, 2016 — gradient checkpointing
S. Lloyd, "Least Squares Quantization in PCM," IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, 1982 — Lloyd-Max algorithm
J. Max, "Quantizing for Minimum Distortion," IRE Trans. Inf. Theory, vol. 6, no. 1, pp. 7–12, 1960 — optimal scalar quantization theory

Acknowledgments

본 연구는 소규모 스타트업인 YATAV Inc.의 연구 조직 AEGIS Research에서 제한된 자원(MacBook Air M4, 2x RTX 5000)으로 수행한 소규모 프로젝트이다. 대규모 GPU 클러스터나 전담 연구 인력 없이, 공개된 선행 연구의 이론적 기반 위에서 독립적인 응용 확장을 시도하였다.

본 프로젝트는 무엇보다도 Amir Zandieh(Google Research), Majid Daliri(New York University), Majid Hadian(Google DeepMind), Vahab Mirrokni(Google Research)의 TurboQuant(ICLR 2026) 연구 없이는 불가능했을 것이다. 랜덤 회전이 거의 최적의 양자화를 가능하게 하는 베타 분포 좌표를 유도한다는 그들의 증명은 본 연구의 이론적 기반이자 직접적 영감이었다. 우리는 그들의 기여를 최고로 존중하며, 이 연구를 공개적으로 접근 가능하게 한 그들의 결정이 우리와 같은 소규모 독립 팀이 이러한 응용 확장을 시도할 수 있게 해주었다. 본 프로젝트의 모든 사용자가 우리의 응용 작업보다 훨씬 더 실질적인 기여를 하는 원 논문을 읽고 인용할 것을 강력히 권장한다.

또한 1-bit Adam(Hanlin Tang et al., ICML 2021)의 모멘텀 압축 및 분산 동결 전략은 TurboQuantAdamW의 직접적 설계 영감이 되었으며, 그 스케일 한계가 v3 아키텍처의 실시간 v_t 설계를 촉발하였다. QSGD(Alistarh et al., NeurIPS 2017)의 확률적 양자화 수렴 이론, THC(Li et al., USENIX NSDI 2024)의 아다마르 회전 기반 통신 압축, EF21(Richtarik et al., NeurIPS 2021)의 오류 피드백 수렴 이론, CompAct(Shamshoum et al., NAACL 2025)의 활성화 압축 접근법에도 깊이 감사드린다. 이 연구자들의 공개적 학술 기여가 있었기에 소규모 팀인 우리도 이 분야에서 의미 있는 시도를 할 수 있었다.

Disclaimer

본 프로젝트는 YATAV Inc. 산하 AEGIS Research의 독립적 응용 연구 활동이다. 원 TurboQuant 저자, Google Research, Google DeepMind, 뉴욕대학교와 제휴, 보증, 공식적 연관이 없다. "TurboQuant-Adam"이라는 명칭은 원 TurboQuant 논문으로부터의 지적 계보를 나타낸다. 본 코드베이스는 원 TurboQuant 프로젝트의 소스 코드를 포함하지 않으며 — 모든 구현은 자체 개발 결과물이다.

Citation

If you use this work, please cite both the original TurboQuant paper and this project.

@inproceedings{zandieh2026turboquant,
  title   = {TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author  = {Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year    = {2026}
}

@techreport{kim2026turboquantadam,
  title       = {TurboQuant-Adam: 4-bit Adaptive Momentum Compression for
                 Communication-Efficient LLM Training},
  author      = {Kim, Kwangil and {AEGIS Research Team}},
  year        = {2026},
  institution = {AEGIS Research, Yatav Inc.},
  number      = {AEGIS-TR-2026-004},
  url         = {https://github.com/kwangilkimkenny/turboquant-adam},
  note        = {Applied research extending TurboQuant (Zandieh et al., ICLR 2026)
                 from distributed inference to distributed training}
}