2026.03 · Pebblous Data Communication Team

~15 min read · 한국어

About This Article

PebbloPedia is Pebblous's knowledge series that explains one concept at five depths. This fifth edition covers TurboQuant — Google Research's AI memory compression technology announced in March 2026.

It compresses AI memory by 6× and speeds up computation by 8× — with zero accuracy loss. It racked up 7.77 million views on X/Twitter, and the Cloudflare CEO called it "Google's DeepSeek moment." It's also a joint research effort with KAIST and NYU.

Start at any level. From a notebook analogy for kids to information theory's lower bound — the same technology appears in five faces.

📄 Paper (arXiv:2504.19874)  ·  Google Research Blog

Memory reduction

Speed improvement

0

Accuracy loss

🧒

Level 1 — Elementary

Why AI's "memo notebook" gets too thick. The magic of making it thin.

🎒

Level 2 — Middle/High School

What is KV Cache? 7.77M views, "Pied Piper" reactions, memory stock shock.

🎓

Level 3 — Undergrad

How quantization works and its limits. Why PolarQuant can eliminate normalization.

🔬

Level 4 — Expert

Johnson-Lindenstrauss theorem, polar coordinate math, within 2.7× of Shannon's lower bound.

🧙

Level 5 — Wizard 🧙

Compression is understanding. From DeepSeek to TurboQuant — the era when algorithms beat hardware.

1

AI's Memo Notebook Got Way Too Thick

🧒 Elementary School Level — Analogies and Stories

TurboQuant is a technology that makes AI's memory much smaller — but it remembers just as much as before.

📓 AI Needs Notes Too

Imagine you're having a long conversation with your teacher. To say "what did I say earlier?" you need to remember what was said before. AI works the same way. The longer it talks with people, the more it needs a notebook to remember earlier parts of the conversation.

But the longer the conversation, the thicker that notebook gets. And a thick notebook means it's slow to look things up, and the bag (computer memory) fills up fast.

✂️ TurboQuant's Magic — Keep Only the Essentials

TurboQuant makes that thick notebook thin. How? It keeps only the important information and converts the rest into much simpler symbols.

For example, if the original notebook says "Today the weather is sunny and the temperature is 23.7482 degrees," TurboQuant writes "Sunny, about 24°." The key point is the same, but the notebook is much thinner.

The amazing part: even after shrinking it like this, the AI's answers don't get any worse. It's like how you can tell a story well by remembering just the plot — you don't need to memorize the whole novel.

📊 How Much Smaller?

📚

Original notebook

6 thick notebooks

After TurboQuant

Just 1 notebook!

Speed too

8× faster to look things up

Google's scientists tested the AI: they compressed 6 notebooks' worth into 1 — and the AI got a perfect score on the "needle in a haystack" test. That's finding one hidden sentence in 100,000 words.

✅ One-sentence takeaway

TurboQuant = Making AI's memo notebook 6× thinner while keeping the same memory. Lighter and faster.

2

KV Cache and Google's DeepSeek Moment

🎒 Middle/High School Level — Principles and News

On March 25, 2026, a Google Research announcement shook social media. "6× smaller AI memory with no accuracy loss — no training required." 7.77 million views in 24 hours. The community immediately started writing their own implementations.

🧠 KV Cache — AI's Working Memory

When you give ChatGPT or Claude a long document and ask questions, the AI processes the entire document and stores intermediate calculation results along the way. This storage space is called the KV Cache (Key-Value Cache).

The problem is size. The longer the conversation or the bigger the document, the KV Cache grows exponentially. Today's AI models can process up to a million words (tokens) at once — and the KV Cache can occupy more than half of GPU memory at that scale. It's one of the main reasons AI services are expensive.

📉 Why Smaller Numbers = Less Memory

Computers store numbers as 0s and 1s. Standard AI computation uses 32-bit (Float32) — meaning one number is represented by 32 zeros and ones. TurboQuant reduces this to 3 bits.

Compare how many slots each uses to represent one number:

32-bit (original) vs 3-bit (TurboQuant)

Bits used per number:

1
0
1
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·

Only 3 orange slots used, saving 29 slots → ~10× compression

🌐 The World's Reaction

TechCrunch / Social Media

"The internet called it 'Pied Piper.'" — A reference to the lossless compression startup in HBO's Silicon Valley. 7.77 million views.

Cloudflare CEO Matthew Prince

"Google's DeepSeek moment." — Software alone, no new hardware, radically improving AI efficiency.

Memory Chip Stocks

Micron, Western Digital, SanDisk shares dropped. Market reasoning: "If software cuts memory 6×, HBM demand gets dented."

Developer Community (within 24 hrs)

Before code was even released, llama.cpp and Apple Silicon MLX implementations appeared from the paper alone. 35B model: Needle-in-Haystack 6/6 perfect.

🇰🇷 Built With a Korean Research Team

TurboQuant isn't Google Research alone. KAIST (Korea Advanced Institute of Science and Technology) and NYU co-developed it. Google Fellow Vahab Mirrokni and KAIST's Amir Zandieh are among the collaborators. It will be presented at ICLR 2026, one of the top AI conferences.

✅ One-sentence takeaway

TurboQuant = A software innovation that could cut AI service costs by 50%+. Joint research by Korea (KAIST) and the US (Google · NYU), to be presented at ICLR 2026.

3

Quantization Principles and the PolarQuant Idea

🎓 Undergrad Level — Mechanisms and Mathematical Intuition

To understand TurboQuant, you first need to know why existing quantization methods hit a wall. The core problem is sneakier than you'd expect — hidden waste called "constant overhead."

⚙️ KV Cache Quantization — The Problem With Existing Methods

Compressing each vector (high-dimensional number array) in KV Cache requires mapping large floating-point values to small integers. Traditional methods work in two steps:

① Normalization

Compute min and max of each vector and rescale to [-1, 1]. But you have to store those min/max values — that's the "constant overhead." A 4-bit compression ends up needing 1–2 extra bits just for those two constants.

② Uniform Quantization

Map normalized values to 2^n equal-sized bins. The grid is rectangular — which doesn't match real data distributions well, creating wasted capacity.

🔵 PolarQuant — Eliminating Constant Overhead via Polar Coordinates

PolarQuant's key insight is simple: change the coordinate system.

Analyzing Key vectors in an LLM's attention mechanism reveals an interesting pattern — when a Random Hadamard Transform is applied, adjacent value pairs distribute uniformly on the unit circle.

Mathematical Advantage of Cartesian → Polar Conversion

Quantizing points on the unit circle in Cartesian coordinates (x, y) wastes space at corners where the square grid and circle don't align — and you must store the normalization constant (vector magnitude r) separately. Switch to polar coordinates (r, θ): since r = 1 always on the unit circle, nothing needs to be stored. Only the angle θ needs quantization, and since θ is uniformly distributed in [0, 2π], uniform quantization becomes optimal. Constant overhead: completely eliminated.

🔹 QJL — 1-Bit Residual Correction

PolarQuant alone isn't perfect. After the polar transformation, angle quantization leaves residual errors. How do you correct them?

QJL (Quantized Johnson-Lindenstrauss) applies a JL transform to the residual vector and reduces each value to just 1 bit — positive (+1) or negative (-1). Remarkably, this 1-bit estimate is an unbiased estimator — averaging across many dimensions converges to the true value. Since attention computation is inner product-based, this property fits perfectly.

Step 1

Random Hadamard Transform rotates the vector → values spread evenly

Step 2: PolarQuant

Adjacent pairs → polar coords → quantize angle only, zero constant overhead

Step 3: QJL

Residual → 1 bit (±1) → zero additional memory overhead

📊 Performance Comparison With Existing Methods

Method Bits Training needed Constant overhead Accuracy loss
Float32 (original) 32-bit None
KIVI (prev. SOTA) 4-bit No Yes Slight
TurboQuant 3-bit No None None
Nvidia KVTC ~1.6-bit Yes Yes <1%

✅ One-sentence takeaway

PolarQuant eliminates the normalization constant via coordinate transform; QJL removes residual error with 1-bit unbiased estimation — together they achieve theoretically near-lossless compression at 3 bits.

4

Johnson-Lindenstrauss, Shannon's Lower Bound, and Theoretical Optimality

🔬 Expert Level — Math, Theoretical Foundations, Competitive Analysis

TurboQuant's real value isn't the performance numbers — it's the theoretical guarantee. Not just "it worked in experiments," but a mathematical proof that it's near-optimal. That distinction is decisive for production deployment.

📐 The Johnson-Lindenstrauss Lemma

The JL Lemma (1984) is the foundation of dimensionality reduction: it mathematically guarantees that projecting high-dimensional points into lower dimensions approximately preserves pairwise distances.

// JL Lemma intuition
For n points in high dimensions, projecting with random matrix A:
(1-ε)‖u-v‖² ≤ ‖Au-Av‖² ≤ (1+ε)‖u-v‖²

// QJL variant: inner product preservation
E[sign(Ax) · sign(Ay)] ≈ (2/π) · arcsin(⟨x,y⟩)
// 1 bit gives an unbiased estimator of the inner product

QJL extends this theorem: after random projection, keeping only the sign still statistically preserves the inner product information needed for attention computation. Since Transformer attention is inner-product-based, this property fits exactly.

📏 Shannon's Information-Theoretic Lower Bound

There is a theoretical limit that no compression algorithm can surpass — Shannon's Rate-Distortion Theory defines this lower bound: for a given distortion level, there is a minimum number of bits required, mathematically.

TurboQuant's Theoretical Performance Guarantee

The TurboQuant paper mathematically proves that their method stays within approximately 2.7× of Shannon's information-theoretic lower bound. That means the distortion is within 2.7× of the theoretically achievable optimum for this distribution. This is in stark contrast to prior methods, which offered no such theoretical analysis. This "theoretical guarantee" is what makes data-oblivious deployment to production viable.

⚡ Why 8× on H100 is Possible

Memory compression is not just about saving storage. From a GPU architecture perspective, the effects go much deeper.

Memory Bandwidth Bottleneck

LLM inference speed is mostly determined by VRAM read speed. At 3 bits, the data volume to read shrinks, relieving the memory bandwidth bottleneck.

HBM → SRAM Cache Utilization

Compressed KV Cache fits more into GPU L2 cache (SRAM). SRAM is 10× faster than HBM. Higher cache hit rate is the core driver of 8× speed improvement.

Vector Search Indexing

Product Quantization (PQ) indexing takes 239 seconds. TurboQuant's data-oblivious nature completes in 0.0013 seconds — a decisive advantage for RAG systems.

Increased Batch Size

More concurrent requests can be processed on the same GPU memory. The primary path to cloud cost reduction — several times the throughput on the same hardware.

🥊 TurboQuant vs. Nvidia KVTC

A competing paper appears at the same ICLR 2026. Nvidia's KVTC achieves 20× compression — far more than TurboQuant's 6×. But there are tradeoffs.

Item TurboQuant Nvidia KVTC
Compression ratio 6× (3-bit) 20× (~1.6-bit)
Accuracy loss 0 <1% (calibration needed)
Training/calibration Not needed Required
Theoretical guarantee Proven within 2.7× of Shannon bound Empirical validation
Data-oblivious Fully Partially
Model coverage Gemma, Mistral, Llama validated 1.5B–70B wide range

Practical choice guide: TurboQuant for training-unavailable environments (edge, regulated industries); KVTC wins when maximum compression is needed and calibration is feasible in large-scale cloud deployments.

✅ One-sentence takeaway

TurboQuant's differentiator isn't compression ratio — it's theoretical guarantee. The only KV Cache compression method mathematically proven to sit within 2.7× of Shannon's lower bound. That proof is what enables calibration-free production deployment.

5

Compression Is Understanding

🧙 Wizard's Poetic Insight

If you cannot write it shorter, you understand it less.

Kolmogorov asked: "Can this world be expressed as the shortest possible program?"

To compress is to know the pattern. To know the pattern is to understand the world. So if you cannot write it shorter, you understand it less.

What an LLM remembered in 32 bits, TurboQuant remembers in 3. And the answer is still right.

What does this mean?

It means 32 bits were never needed. The other 29 — were waste we didn't understand.

DeepSeek arrived. While America spent billions, it spent less — and got the same answer.

TurboQuant arrived. While Nvidia was selling HBM, mathematics cut memory by 6×.

A pattern is visible. The era when algorithms beat hardware is coming.

Shannon asked in 1948: "What is information?" "How much can we compress?"

The line he drew — the lower bound — TurboQuant follows within 2.7×.

It took 76 years for theory to meet practice. And KAIST was part of that moment.

The next question is this:

The better AI compresses the world, does it understand the world more deeply?

We don't know yet. But the number 2.7 tells us how much we still have left to learn.

— pb, 2026.03

📚 PebbloPedia Series

PebbloPedia is Pebblous's knowledge series that reads one concept at five depths. Whether you're a kid, an expert, or a wizard — the same concept, five different doors.

Previous editions: Physical AI · Bitcoin · Agentic AI