Memory-Transformer Research Tree

Carrying on from the last blog: https://shermwong.com/2026/02/22/recsys-for-real-time-ai-agents/ – which is a more high level depiction of possible future states of personalized, real-time AI agents. Here we dig into a deeper topic of how backbone transformer models has evolved so far on memory architectures, which is a key module to enable continuous adaptation of agents.

Current State

The way memory extensions are connected to transformer backbones are 3 ways:

  • Runtime Memory System: long context enabled KV cache layers + compression
  • Retrieval System: aka RAG + retrieval optimizations similar to existing information retrieval tasks (search + ranking + multi-stage + online update)
  • Learned Memory State: embedding modules that extended from transformer backbone layers.

0) Pre-history (foundation)

Transformer-XL / Compressive Transformer / RMT (Recurrent Memory Transformer) Goal: pass hidden state across segments as memory. Limitation: breaks parallelism, unstable at LLM scale. 

Core “origin” papers:


I) External memory (retrieval memory)

RAG / RETRO / kNN-LM / Memorizing Transformer Idea: memory lives outside the model (vector DB or KV store). Pros: scalable, stable, production-friendly. Cons: not differentiable global reasoning memory. 

Adoption: already dominant in real LLM systems. 

Origin / high-impact papers:


II) KV-cache + structured runtime memory (systems path)

PagedAttention / block KV / shared/persistent KV / compressed KV New direction (2024–2026):

  • KV reuse across sessions
  • hierarchical KV (fast + slow memory)
  • learned KV compression
  • Trellis-style bounded KV memory
  • prefix/prompt persistent memory

Why hot: inference cost dominates. This is infra-aligned with how real LLMs run. 

Adoption: already happening (vLLM, etc.). 

Prediction: KV cache will evolve into learned structured memory. This is a practical near-term direction. 

Origin / high-impact papers:

  • PagedAttention / vLLMEfficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023): https://arxiv.org/abs/2309.06180
  • Multi-Query Attention (decoding-side KV reduction primitive)Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019): https://arxiv.org/abs/1911.02150
  • “Trellis-style bounded KV memory”Trellis: Learning to Compress Key-Value Memory in Attention Models (2025): https://arxiv.org/abs/2512.23852

III) Learned internal memory modules (DeepSeek line)

A) Memory token / slot memory

Examples: memory tokens / prefix memory; Perceiver latent memory. Idea: fixed learnable memory slots persist across sequence.

Problems:

  • scaling unclear
  • capacity fixed
  • retrieval soft/implicit
  • weak for precise reasoning 

Adoption: low. 

Origin / high-impact papers:

B) Recurrent persistent memory (RMT lineage)

Examples: Recurrent Memory Transformer; associative memory transformer; LM2 large memory model. Idea: persistent hidden memory updated per segment.

Pros:

  • true long-term memory
  • infinite context theoretically
  • reasoning continuity 

Cons:

  • kills parallel training
  • gradient instability
  • hard to scale >100B
  • infra mismatch with GPUs

Adoption: research only.

Origin / high-impact papers:

C) Hierarchical / compressive memory

Examples: HMT; Infini-attention; compressive KV memory models. Idea: compress history into structured memory blocks.

Pros:

  • bounded memory: no unbounded KV-cache; must summarize / distill / select what to retain.
  • closer to brain-like hierarchy

Cons:

  • compression loss
  • training complexity 

Adoption: emerging but not dominant. 

Origin / high-impact papers:


What changed after DeepSeek

DeepSeek basically signaled: memory will become a first-class architecture primitive. Shift observed:

  • Before 2024: focus = longer context
  • After DeepSeek: focus = persistent memory + bounded compute

Alpha Paper: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models (https://arxiv.org/html/2601.07372v1)

New papers now assume: context window scaling alone is not enough. 


Why memory transformers still not dominant

  1. Scaling law economics: increasing context from 128k → 1M is easier than redesigning architecture.
  2. Parallel training constraint: persistent memory introduces recurrence → bad for GPU throughput.
  3. Stability: memory update rules destabilize trillion-token training.
  4. Infra lock-in: everything is built around stateless transformer + KV cache; changing this is extremely expensive.

Convergence path (plausible trajectory, no single bet)

Short term (2025-2026)

Likely evolution: Transformer + production KV-cache improvements + retrieval augmentation.

  • KV cache becomes more systemized (paging / sharing / reuse) and begins to look like a runtime memory layer.
  • RAG remains the dominant mechanism for adding/updating external knowledge without retraining.

Mid term (2027-2029)

Likely evolution: Transformer + hierarchical KV memory + learned compression + limited/safe parameter updates.

  • Hierarchical KV: multiple tiers (fast recent-context + slower persistent tiers).
  • Learned compression: distill long context into compact state that can be carried forward.
  • Partial sparse updates: update only a small, well-isolated subset of parameters/state (vs. global dense finetuning).

Long term (5–10+ yrs)

One plausible direction (RecSys-inspired, but not guaranteed): a mostly-stable dense backbone plus massive sparse memory tables that support continual + personalized updates.

  • Dense core stays largely frozen for stability and compatibility.
  • Sparse memory tables (user / task / knowledge) carry most of the continually-updated, personalized signal.
  • Updates are sparse/localized (RecSys-style): only a tiny subset of the memory is touched per step.

Why this is only a possibility (not a claim about the future)

  • It requires a good answer to what should live in sparse memory for language tasks (and how it is addressed/read/written).
  • It requires safe online update rules that don’t destabilize the base model or create rapid distribution drift.
  • It requires clean integration with runtime memory (KV/hierarchical KV) and retrieval memory (RAG).

Leave a comment