Memory-Transformer Research Tree

Carrying on from the last blog: https://shermwong.com/2026/02/22/recsys-for-real-time-ai-agents/ – which is a more high level depiction of possible future states of personalized, real-time AI agents. Here we dig into a deeper topic of how backbone transformer models has evolved so far on memory architectures, which is a key module to enable continuous adaptation of agents.

Current State

The way memory extensions are connected to transformer backbones are 3 ways:

Runtime Memory System: long context enabled KV cache layers + compression
Retrieval System: aka RAG + retrieval optimizations similar to existing information retrieval tasks (search + ranking + multi-stage + online update)
Learned Memory State: embedding modules that extended from transformer backbone layers.

0) Pre-history (foundation)

Transformer-XL / Compressive Transformer / RMT (Recurrent Memory Transformer) Goal: pass hidden state across segments as memory. Limitation: breaks parallelism, unstable at LLM scale.

Core “origin” papers:

Transformer-XL — Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Dai et al., 2019): https://arxiv.org/abs/1901.02860
Compressive Transformer — Compressive Transformers for Long-Range Sequence Modelling (Rae et al., 2019): https://arxiv.org/abs/1911.05507
Recurrent Memory Transformer (RMT) — Recurrent Memory Transformer (Bulatov et al., 2022): https://arxiv.org/abs/2207.06881

I) External memory (retrieval memory)

RAG / RETRO / kNN-LM / Memorizing Transformer Idea: memory lives outside the model (vector DB or KV store). Pros: scalable, stable, production-friendly. Cons: not differentiable global reasoning memory.

Adoption: already dominant in real LLM systems.

Origin / high-impact papers:

RAG — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020): https://arxiv.org/abs/2005.11401
RETRO — Improving language models by retrieving from trillions of tokens (Borgeaud et al., 2021): https://arxiv.org/abs/2112.04426
kNN-LM (NN-LM) — Generalization through Memorization: Nearest Neighbor Language Models (Khandelwal et al., 2019): https://arxiv.org/abs/1911.00172
Memorizing Transformers — Memorizing Transformers (Wu et al., 2022): https://arxiv.org/abs/2203.08913
(Common “retrieval-augmented LM line” at Meta) ATLAS — Atlas: Few-shot Learning with Retrieval Augmented Language Models (Izacard et al., 2022): https://arxiv.org/abs/2208.03299

II) KV-cache + structured runtime memory (systems path)

PagedAttention / block KV / shared/persistent KV / compressed KV New direction (2024–2026):

KV reuse across sessions
hierarchical KV (fast + slow memory)
learned KV compression
Trellis-style bounded KV memory
prefix/prompt persistent memory

Why hot: inference cost dominates. This is infra-aligned with how real LLMs run.

Adoption: already happening (vLLM, etc.).

Prediction: KV cache will evolve into learned structured memory. This is a practical near-term direction.

Origin / high-impact papers:

PagedAttention / vLLM — Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023): https://arxiv.org/abs/2309.06180
Multi-Query Attention (decoding-side KV reduction primitive) — Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019): https://arxiv.org/abs/1911.02150
“Trellis-style bounded KV memory” — Trellis: Learning to Compress Key-Value Memory in Attention Models (2025): https://arxiv.org/abs/2512.23852

III) Learned internal memory modules (DeepSeek line)

A) Memory token / slot memory

Examples: memory tokens / prefix memory; Perceiver latent memory. Idea: fixed learnable memory slots persist across sequence.

Problems:

scaling unclear
capacity fixed
retrieval soft/implicit
weak for precise reasoning

Adoption: low.

Origin / high-impact papers:

Perceiver — Perceiver: General Perception with Iterative Attention (Jaegle et al., 2021): https://arxiv.org/abs/2103.03206
Perceiver IO (structured IO via latent bottleneck) — Perceiver IO: A General Architecture for Structured Inputs & Outputs (Jaegle et al., 2021): https://arxiv.org/abs/2107.14795
Prefix memory framing (continuous prefix as persistent “soft state”) — Prefix-Tuning (Li & Liang, 2021): https://arxiv.org/abs/2101.00190

B) Recurrent persistent memory (RMT lineage)

Examples: Recurrent Memory Transformer; associative memory transformer; LM2 large memory model. Idea: persistent hidden memory updated per segment.

Pros:

true long-term memory
infinite context theoretically
reasoning continuity

Cons:

kills parallel training
gradient instability
hard to scale >100B
infra mismatch with GPUs

Adoption: research only.

Origin / high-impact papers:

RMT — Recurrent Memory Transformer (Bulatov et al., 2022): https://arxiv.org/abs/2207.06881
“1M tokens with recurrence” — Scaling Transformer to 1M tokens and beyond with RMT (Bulatov et al., 2023): https://arxiv.org/abs/2304.11062
LM2 (explicit “large memory model” framing) — LM2: Large Memory Models (2025): https://arxiv.org/abs/2502.06049
Associative recurrent memory transformer — Associative Recurrent Memory Transformer (2024): https://arxiv.org/abs/2407.04841

C) Hierarchical / compressive memory

Examples: HMT; Infini-attention; compressive KV memory models. Idea: compress history into structured memory blocks.

Pros:

bounded memory: no unbounded KV-cache; must summarize / distill / select what to retain.
closer to brain-like hierarchy

Cons:

compression loss
training complexity

Adoption: emerging but not dominant.

Origin / high-impact papers:

Compressive Transformer — Compressive Transformers for Long-Range Sequence Modelling (Rae et al., 2019): https://arxiv.org/abs/1911.05507
Infini-attention — Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (2024): https://arxiv.org/abs/2404.07143
HMT — HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing (2024): https://arxiv.org/abs/2405.06067

What changed after DeepSeek

DeepSeek basically signaled: memory will become a first-class architecture primitive. Shift observed:

Before 2024: focus = longer context
After DeepSeek: focus = persistent memory + bounded compute

Alpha Paper: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models (https://arxiv.org/html/2601.07372v1)

New papers now assume: context window scaling alone is not enough.

Why memory transformers still not dominant

Scaling law economics: increasing context from 128k → 1M is easier than redesigning architecture.
Parallel training constraint: persistent memory introduces recurrence → bad for GPU throughput.
Stability: memory update rules destabilize trillion-token training.
Infra lock-in: everything is built around stateless transformer + KV cache; changing this is extremely expensive.

Convergence path (plausible trajectory, no single bet)

Short term (2025-2026)

Likely evolution: Transformer + production KV-cache improvements + retrieval augmentation.

KV cache becomes more systemized (paging / sharing / reuse) and begins to look like a runtime memory layer.
RAG remains the dominant mechanism for adding/updating external knowledge without retraining.

Mid term (2027-2029)

Likely evolution: Transformer + hierarchical KV memory + learned compression + limited/safe parameter updates.

Hierarchical KV: multiple tiers (fast recent-context + slower persistent tiers).
Learned compression: distill long context into compact state that can be carried forward.
Partial sparse updates: update only a small, well-isolated subset of parameters/state (vs. global dense finetuning).

Long term (5–10+ yrs)

One plausible direction (RecSys-inspired, but not guaranteed): a mostly-stable dense backbone plus massive sparse memory tables that support continual + personalized updates.

Dense core stays largely frozen for stability and compatibility.
Sparse memory tables (user / task / knowledge) carry most of the continually-updated, personalized signal.
Updates are sparse/localized (RecSys-style): only a tiny subset of the memory is touched per step.

Why this is only a possibility (not a claim about the future)

It requires a good answer to what should live in sparse memory for language tasks (and how it is addressed/read/written).
It requires safe online update rules that don’t destabilize the base model or create rapid distribution drift.
It requires clean integration with runtime memory (KV/hierarchical KV) and retrieval memory (RAG).

Current State

0) Pre-history (foundation)

I) External memory (retrieval memory)

II) KV-cache + structured runtime memory (systems path)

III) Learned internal memory modules (DeepSeek line)

A) Memory token / slot memory

B) Recurrent persistent memory (RMT lineage)

C) Hierarchical / compressive memory

What changed after DeepSeek

Why memory transformers still not dominant

Convergence path (plausible trajectory, no single bet)

Short term (2025-2026)

Mid term (2027-2029)

Long term (5–10+ yrs)

Share this:

Related

Leave a comment Cancel reply