Week 12 · May 2026

IceCache: Cutting LLM Memory Costs Without Cutting Quality

May 16, 2026 · by Satish K C 8 min read
Deep Learning Efficiency LLMs Optimization

The Paper

"IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs" was authored by Yuzhen Mao, Qitong Wang, Martin Ester, and Ke Li from Simon Fraser University and Harvard University, submitted to arXiv in April 2026. The paper's central claim is that existing KV cache offloading methods fail at long-context inference because they select tokens based on imprecise heuristics rather than semantic relevance - and that replacing these heuristics with a hierarchical semantic index (the DCI-tree) restores accuracy while using only 25% of the token budget required by standard approaches.

The Problem Before This Paper

KV cache memory scales linearly with sequence length. A 128k-token context on a 7B model can consume tens of gigabytes of GPU memory just for the cache - making long-context inference impractical on consumer hardware and expensive at scale. Existing compression strategies fall into two camps: eviction methods (StreamingLLM, SnapKV) that permanently discard tokens based on local attention scores, and offloading methods (MagicPig, ArkVale, PQCache) that page tokens to CPU memory and reload them per query. Both suffer from the same root failure: token selection is guided by sequential position or shallow similarity estimates rather than true semantic relevance. When a query arrives, the wrong tokens get loaded back from CPU, accuracy degrades, and latency spikes from redundant CPU-GPU transfers.

What They Built

IceCache replaces the flat token storage used by prior offloading methods with a Dynamic Continuous Indexing tree (DCI-tree) - a hierarchical multi-level index that clusters key embeddings by semantic similarity at build time. Tokens are promoted probabilistically across levels using promotion ratio r, producing a geometric distribution across tree levels: nℓ = r · nℓ‐1. At query time, the system traverses the tree top-down to retrieve the most semantically relevant token pages rather than scanning linearly. The key and query embeddings are normalized before indexing:

T_K(k_j) = [ k_j / c , sqrt(1 - ||k_j||^2 / c^2) ]    (key transform)
T_Q(q_i) = [ q_i / ||q_i|| , 0 ]                    (query transform)

This normalization converts the maximum inner product search (MIPS) problem into a nearest-neighbor search, which the DCI-tree solves efficiently. Pages of semantically related tokens are stored contiguously in CPU memory and transferred as units during decoding - reducing both transfer volume and the number of round trips. An optional reuse variant caches recently retrieved pages across decoding steps, cutting time-to-second-token from 7.7s to 5.9s on 36k-token sequences.

DCI-Tree Index Hierarchical semantic clustering of key embeddings. Top-down traversal at query time retrieves relevant pages, not random token windows.
Paged Storage Semantically related tokens stored contiguously in CPU memory. Entire pages transferred per query - fewer round trips, higher bandwidth utilization.
Reuse Variant Recently retrieved pages cached across decoding steps. Cuts time-to-second-token by 23% on 36k-token sequences.

Key Findings

Results

Method Model Budget LongBench Avg
Full KV-cacheLlama-3.1-8B100%49.5
IceCacheLlama-3.1-8B256 tokens49.0
PQCacheLlama-3.1-8B256 tokens47.3
Full KV-cacheMistral-7B100%42.2
IceCacheMistral-7B256 tokens41.7
MagicPigMistral-7B256 tokens39.1
IceCacheLlama-3.1-8B64 tokens47.8

On latency, at a 36k-token sequence length, IceCache's reuse variant achieves a time-to-second-token of 5.9 seconds versus 7.7 seconds for the base variant, with time-per-output-token of 0.06 seconds. The latency breakdown shows DCI-query at 0.05s, decoding at 0.04s, and loading at 0.015s - query and decoding together dominate, not transfer, which confirms that the semantic clustering is reducing the number of tokens that need to move across the PCIe bus.

Why This Matters for AI and Automation

Practical implications

My Take

The core insight - that semantic similarity is a better selection criterion than recency or local attention weight - is correct and well-evidenced. The 4x efficiency gain over PQCache at matched accuracy (64 tokens vs 256 tokens) is the number that should get attention, not the headline 99% figure which sounds impressive but obscures that the baseline already works reasonably well. What IceCache actually demonstrates is that prior methods were leaving significant efficiency on the table by ignoring the semantic structure of the context. The open question is build-time cost: constructing and maintaining the DCI-tree during prefill adds overhead that the paper does not fully quantify at context lengths beyond 250k tokens. The RULER results on Qwen3-4B at 250k tokens show latency scaling slower than full cache, which is the right direction, but the prefill cost for the index itself needs more scrutiny before this becomes a production recommendation. For practitioners running 32k-128k contexts today, this is worth evaluating seriously.

Discussion Question

IceCache shows that selecting tokens by semantic relevance beats selecting by position or attention weight - but building the semantic index adds prefill overhead. At what context length does the inference efficiency gain justify the indexing cost, and how would you evaluate that tradeoff in your own deployment?

← Back to all papers
Share