Google TurboQuant: 8x Faster, 6x Less Memory — The Compression That Will Change LLM Inference Costs
Google Research unveils TurboQuant, an algorithm that cuts KV cache memory 6x and speeds up attention 8x on H100 — with zero accuracy loss and no fine-tuning. Presented at ICLR 2026.

Compress an LLM's memory footprint by 6x and speed up inference by 8x — without losing a single bit of accuracy, without retraining a single weight. That's what Google Research just published with TurboQuant, presented at ICLR 2026 and AISTATS 2026. The implications are immediate: fewer GPUs, longer context windows, slashed inference costs — for every model, not just Gemini.
The KV cache: the bottleneck nobody talked about
When an LLM generates text, it must re-read the entire context for every new token. The KV cache (Key-Value cache) is the temporary GPU memory that stores this contextual information during inference. It's what allows the model to "remember" the conversation.
The problem: this memory explodes with context length. At 32,000 tokens, the KV cache already consumes several gigabytes of VRAM on an H100 GPU (NVIDIA's flagship AI accelerator). At 1 million tokens — Gemini 2.5 Pro's context window — it becomes unmanageable without aggressive compression.
Until now, quantization (compressing data by reducing the number of bits per value) introduced 1 to 2 bits of overhead (extra storage cost) per compressed value. This overhead cancelled out much of the compression benefit. TurboQuant solves this paradox: maximum compression, strictly zero memory overhead.
The algorithm comes from Amir Zandieh (Research Scientist) and Vahab Mirrokni (VP and Google Fellow). Tested on Llama-3.1-8B, Gemma, and Mistral. According to the Google Research official paper, Gemini is explicitly mentioned as a direct application.
PolarQuant: storing angles instead of coordinates
TurboQuant's first algorithm is called PolarQuant. Its core idea is simple: change how vectors are represented.
A GPS metaphor works well here. Imagine giving directions in a city. The traditional method says "3 blocks East, 4 blocks North." PolarQuant says "5 blocks at 37°." The result is identical. But the polar version has a decisive advantage: grid boundaries are fixed and universal.
In Cartesian coordinates, every compression requires storing the grid boundaries — that's the overhead that plagues classical methods. In polar coordinates, boundaries are mathematically predictable. Zero extra bits to store.
Result: PolarQuant captures 99% of the original vector's information. With zero memory overhead. That's already remarkable. But there's still that 1% residual.
QJL: 1 bit to eliminate the residual
Enter QJL — Quantized Johnson-Lindenstrauss. The Johnson-Lindenstrauss lemma is a mathematical result guaranteeing you can "project" a vector into a smaller space while preserving relative distances between points.
QJL pushes this idea to the extreme. It reduces each residual vector — the 1% PolarQuant didn't capture — to a single binary value. Plus one or minus one. A single bit.
The combination gives TurboQuant: PolarQuant captures 99% of the signal in polar coordinates, QJL eliminates the residual in 1 bit. Total: 3 bits per value are enough for perfect accuracy. Classical methods require 16 or 32 bits for the same result.
8x on H100: what this changes in production
The benchmarks leave no room for debate. On an NVIDIA H100 GPU, TurboQuant delivers an 8x speedup on attention logit computation in 4-bit mode, compared to 32-bit unquantized baseline. KV cache memory is reduced by a minimum of 6x.
Business translation: with the same GPU infrastructure, you can serve 6x more parallel requests. Or offer 6x longer context for the same cost. No fine-tuning, no model modification, no retraining.
Results are validated across five reference benchmarks: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. The Needle In A Haystack test — finding a specific piece of information buried in millions of tokens — is particularly telling. TurboQuant maintains perfect performance with 6x memory reduction.
This is exactly the kind of compression that AI agents need for extended memory management. And it's what makes Gemini 2.5 Pro's million-token context technically viable.
| Metric | Result |
|---|---|
| KV cache memory reduction | 6x minimum |
| Attention speedup (4-bit, H100) | 8x vs 32-bit |
| Bits for zero accuracy loss | 3 bits |
| Fine-tuning required | ❌ None |
| Accuracy loss | ❌ None |
| Models tested | Llama-3.1-8B, Gemma, Mistral |
Beyond inference: vector search at Google scale
TurboQuant isn't limited to LLM KV caches. The algorithm also applies to vector search (semantic search) — the technology powering Google Search, YouTube, and recommendation systems at the scale of billions of embedding vectors (numerical representations of meaning).
On this front, TurboQuant outperforms state-of-the-art methods: PQ (Product Quantization) and RaBitQ. The key difference: TurboQuant works without dataset-specific codebooks. It applies directly, with no prior calibration.
This is a technology that touches every Google product using embeddings. Which is nearly all of them.
| Criterion | Classical quantization | PolarQuant alone | TurboQuant |
|---|---|---|---|
| Memory overhead | 1-2 bits/value | 0 bits | 0 bits |
| Accuracy preserved | ⚠️ Partial | ✅ 99% | ✅ 100% |
| Fine-tuning required | ✅ Often | ❌ No | ❌ No |
| Speedup on H100 | Variable | Partial | 8x |
| Applicable to vector search | ❌ No | ❌ No | ✅ Yes |
Key takeaways
- TurboQuant is a Google Research vector compression algorithm for LLM KV caches, presented at ICLR 2026 and AISTATS 2026
- It reduces KV cache memory by 6x minimum and speeds up attention computation by 8x on H100, with zero accuracy loss
- It combines PolarQuant (polar coordinates, zero overhead) and QJL (1-bit residual, zero overhead) to achieve 3 bits per compressed value
- No fine-tuning required: compatible with any existing LLM — Llama-3.1-8B, Gemma, Mistral, Gemini
- Also applicable to vector search at the scale of billions of vectors for Google Search and YouTube
The LLM race has long been a war of parameters. But in 2026, the real differentiator is no longer model size — it's inference efficiency. TurboQuant isn't just another academic paper: it's the algorithm that lets Gemini offer 1 million tokens of context without collapsing Google's data centers. While DeepSeek stacks trillions of parameters and NVIDIA builds superclusters, Google is playing a different game: doing more with less. Every compression point gained here translates to millions of dollars saved at production scale. The most powerful AI won't be the one with the most parameters — it'll be the one that does the most with the least.


