IA25 de marzo de 2026 · 16:175 min de lecturaPor Paul Lefizelier

Google TurboQuant: 8x Faster, 6x Less Memory — The Compression That Will Change LLM Inference Costs

Google Research unveils TurboQuant, an algorithm that cuts KV cache memory 6x and speeds up attention 8x on H100 — with zero accuracy loss and no fine-tuning. Presented at ICLR 2026.

Resumir con la IA ChatGPT Claude Perplexity Gemini

Google TurboQuant: 8x Faster, 6x Less Memory — The Compression That Will Change LLM Inference Costs

Compress an LLM's memory footprint by 6x and speed up inference by 8x — without losing a single bit of accuracy, without retraining a single weight. That's what Google Research just published with TurboQuant, presented at ICLR 2026 and AISTATS 2026. The implications are immediate: fewer GPUs, longer context windows, slashed inference costs — for every model, not just Gemini.

The KV cache: the bottleneck nobody talked about

When an LLM generates text, it must re-read the entire context for every new token. The KV cache (Key-Value cache) is the temporary GPU memory that stores this contextual information during inference. It's what allows the model to "remember" the conversation.

The problem: this memory explodes with context length. At 32,000 tokens, the KV cache already consumes several gigabytes of VRAM on an H100 GPU (NVIDIA's flagship AI accelerator). At 1 million tokens — Gemini 2.5 Pro's context window — it becomes unmanageable without aggressive compression.

Until now, quantization (compressing data by reducing the number of bits per value) introduced 1 to 2 bits of overhead (extra storage cost) per compressed value. This overhead cancelled out much of the compression benefit. TurboQuant solves this paradox: maximum compression, strictly zero memory overhead.

The algorithm comes from Amir Zandieh (Research Scientist) and Vahab Mirrokni (VP and Google Fellow). Tested on Llama-3.1-8B, Gemma, and Mistral. According to the Google Research official paper, Gemini is explicitly mentioned as a direct application.

PolarQuant: storing angles instead of coordinates

TurboQuant's first algorithm is called PolarQuant. Its core idea is simple: change how vectors are represented.

A GPS metaphor works well here. Imagine giving directions in a city. The traditional method says "3 blocks East, 4 blocks North." PolarQuant says "5 blocks at 37°." The result is identical. But the polar version has a decisive advantage: grid boundaries are fixed and universal.

In Cartesian coordinates, every compression requires storing the grid boundaries — that's the overhead that plagues classical methods. In polar coordinates, boundaries are mathematically predictable. Zero extra bits to store.

Result: PolarQuant captures 99% of the original vector's information. With zero memory overhead. That's already remarkable. But there's still that 1% residual.

QJL: 1 bit to eliminate the residual

Enter QJL — Quantized Johnson-Lindenstrauss. The Johnson-Lindenstrauss lemma is a mathematical result guaranteeing you can "project" a vector into a smaller space while preserving relative distances between points.

QJL pushes this idea to the extreme. It reduces each residual vector — the 1% PolarQuant didn't capture — to a single binary value. Plus one or minus one. A single bit.

The combination gives TurboQuant: PolarQuant captures 99% of the signal in polar coordinates, QJL eliminates the residual in 1 bit. Total: 3 bits per value are enough for perfect accuracy. Classical methods require 16 or 32 bits for the same result.

8x on H100: what this changes in production

The benchmarks leave no room for debate. On an NVIDIA H100 GPU, TurboQuant delivers an 8x speedup on attention logit computation in 4-bit mode, compared to 32-bit unquantized baseline. KV cache memory is reduced by a minimum of 6x.

Business translation: with the same GPU infrastructure, you can serve 6x more parallel requests. Or offer 6x longer context for the same cost. No fine-tuning, no model modification, no retraining.

Results are validated across five reference benchmarks: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. The Needle In A Haystack test — finding a specific piece of information buried in millions of tokens — is particularly telling. TurboQuant maintains perfect performance with 6x memory reduction.

This is exactly the kind of compression that AI agents need for extended memory management. And it's what makes Gemini 2.5 Pro's million-token context technically viable.

Metric	Result
KV cache memory reduction	6x minimum
Attention speedup (4-bit, H100)	8x vs 32-bit
Bits for zero accuracy loss	3 bits
Fine-tuning required	❌ None
Accuracy loss	❌ None
Models tested	Llama-3.1-8B, Gemma, Mistral

Beyond inference: vector search at Google scale

TurboQuant isn't limited to LLM KV caches. The algorithm also applies to vector search (semantic search) — the technology powering Google Search, YouTube, and recommendation systems at the scale of billions of embedding vectors (numerical representations of meaning).

On this front, TurboQuant outperforms state-of-the-art methods: PQ (Product Quantization) and RaBitQ. The key difference: TurboQuant works without dataset-specific codebooks. It applies directly, with no prior calibration.

This is a technology that touches every Google product using embeddings. Which is nearly all of them.

Criterion	Classical quantization	PolarQuant alone	TurboQuant
Memory overhead	1-2 bits/value	0 bits	0 bits
Accuracy preserved	⚠️ Partial	✅ 99%	✅ 100%
Fine-tuning required	✅ Often	❌ No	❌ No
Speedup on H100	Variable	Partial	8x
Applicable to vector search	❌ No	❌ No	✅ Yes

Key takeaways

TurboQuant is a Google Research vector compression algorithm for LLM KV caches, presented at ICLR 2026 and AISTATS 2026
It reduces KV cache memory by 6x minimum and speeds up attention computation by 8x on H100, with zero accuracy loss
It combines PolarQuant (polar coordinates, zero overhead) and QJL (1-bit residual, zero overhead) to achieve 3 bits per compressed value
No fine-tuning required: compatible with any existing LLM — Llama-3.1-8B, Gemma, Mistral, Gemini
Also applicable to vector search at the scale of billions of vectors for Google Search and YouTube

The LLM race has long been a war of parameters. But in 2026, the real differentiator is no longer model size — it's inference efficiency. TurboQuant isn't just another academic paper: it's the algorithm that lets Gemini offer 1 million tokens of context without collapsing Google's data centers. While DeepSeek stacks trillions of parameters and NVIDIA builds superclusters, Google is playing a different game: doing more with less. Every compression point gained here translates to millions of dollars saved at production scale. The most powerful AI won't be the one with the most parameters — it'll be the one that does the most with the least.

#google #turboquant #polarquant #quantization #kv-cache #llm #inference #compression #h100 #iclr-2026 #gemini #llama

← Volver a noticias

Producto

Recursos

Google TurboQuant: 8x Faster, 6x Less Memory — The Compression That Will Change LLM Inference Costs

The KV cache: the bottleneck nobody talked about

PolarQuant: storing angles instead of coordinates

QJL: 1 bit to eliminate the residual

8x on H100: what this changes in production

Beyond inference: vector search at Google scale

Key takeaways

Más noticias

Microsoft Launches Agent Governance Toolkit: 7 Open-Source Packages to Secure AI Agents — Kill Switch, EU AI Act, Cryptographic Identity

Google Gemma 4: AIME 20% → 89%, Codeforces 110 → 2150, Apache 2.0 — The Leap That Redefines Open-Source Models

Cloudflare Launches EmDash: The Open-Source WordPress Successor Built for AI Agents — Native MCP, Sandboxed Plugins, TypeScript