Smart context

I wrote about context management in 2026 a couple weeks ago, covering compaction, truncation, subagents, tool search, and all the techniques people are using to keep agent context windows under control. But I left out the one approach that feels most obvious and least explored: just drop the stuff that's no longer relevant. Not summarize it. Not compress it. Drop it.

The append-only problem

Every agent framework today treats the context window like an append-only log. Messages go in, tool results go in, file contents go in. Nothing comes out unless you compact the entire thing or start a new session. This is insane when you think about it. If you're coding and you load file A, work on it for ten turns, then move to file B, file A is still sitting in your context. Every token of it. Competing for the model's attention on every single inference call, even though you haven't touched it in fifteen minutes. The agent has tools. It can read file A again if it needs to. It will notice the file isn't in context and just go fetch it. So why are we paying the attention tax on tokens that aren't contributing anything? Anthropic's engineering team calls this a finite "attention budget." The transformer architecture forces every token to attend to every other token. As context grows, that budget gets stretched thin. Google's Gemini team saw this directly when their agent played Pokémon: beyond 100k tokens, the agent started repeating past actions instead of developing new strategies. The context that was supposed to help became the thing holding it back. Compaction and truncation are the standard answers, but they're both reactive. They kick in when you're already at the limit. What if instead, you were continuously pruning the context as you go, keeping only what's actively relevant?

What smart eviction looks like

The idea is simple in concept: before every inference call, evaluate each block of context for relevance to the current task. If it's cold, drop it. If the agent needs it later, it re-fetches. This mirrors how operating systems manage memory. Your OS doesn't keep every file you've ever opened in RAM. It maintains a working set of pages that are actively being used, and swaps cold pages to disk. When you access a swapped-out page, it gets loaded back in transparently. The context window should work the same way. There are a few ways to implement the relevance scoring: Semantic similarity. Before each inference pass, compute embeddings for each context block and the current user message plus recent turns. Blocks below a similarity threshold are candidates for eviction. This is cheap (a fast embedding model adds milliseconds), works without access to model internals, and handles topic shifts naturally. The downside is that it can miss context that's semantically distant but logically important, like an architectural decision from early in the session that constrains everything you're doing now. Task tagging and dependency tracking. Tag each context block with metadata: which task it relates to, which files it references, when it was last "used" (referenced in the model's output). When the user shifts from task A to task B, blocks tagged to task A become eviction candidates. This handles multi-task workflows well and mirrors how developers actually think about context, but it requires robust task-shift detection. Attention-based scoring. After each inference pass, look at which parts of the context the model actually attended to. If a block hasn't received meaningful attention for N turns, it's stale. This is the most direct signal, you're literally measuring what the model finds useful, but it requires access to attention weights, which most API providers don't expose. The manifest pattern. This is the one I find most elegant. Instead of fully evicting a block, you replace it with a one-line summary in a manifest section: "File A (auth module) was loaded at turn 3. Use readFile to access again if needed." The model retains awareness that the information exists and where to find it, but you've gone from 500 tokens to 15. If the model needs that file, it reads the manifest entry, calls the tool, and gets fresh content. The manifest pattern is essentially a page table for your context window. Cold pages get swapped out, but their entries remain in the table so the system knows they exist.

The people working on this

This isn't entirely theoretical. There's active research and implementation happening at multiple levels. At the model level, Locret (presented at NeurIPS) trains lightweight "retaining heads" that score which KV cache entries to evict during inference. It achieves 128K+ context on a single NVIDIA 4090 without quality loss. KeyDiff (also NeurIPS) takes a different approach, using key similarity in the attention mechanism to identify which tokens are geometrically distinctive and worth keeping. Both operate at the KV cache level, below the application layer, but they prove the core concept: you can selectively evict context without destroying coherence. At the application level, Cursor's dynamic context discovery pattern is the closest to production implementation. Their philosophy is that agents should fetch context on demand rather than front-loading everything. The natural extension, which they haven't fully shipped yet, is actively evicting context that was fetched earlier but is no longer needed. JetBrains Research published work in late 2025 specifically studying how agent-generated context becomes noise over time. Their finding was clear: selective pruning outperforms letting context accumulate, even when the context window isn't full yet. The problem isn't just running out of space. It's that irrelevant context actively degrades performance. The "Bill of Lading" architecture proposes a manifest-based system where eviction decisions accumulate during the conversation and the KV cache gets rebuilt asynchronously in the background while the user is typing. You never wait for eviction because it happens continuously.

Why nobody has shipped this yet

If the concept is sound and the research exists, why isn't every agent framework doing this? The answer is integration. In Claude Code, Cursor, ChatGPT, or any hosted agent, the conversation history is an internal messages array that gets passed to the model on every turn. There is no external hook to mutate it. MCP tools are called by the agent. They don't wrap around the agent's inference loop. You can't build a standard MCP server that reaches into the host's context and removes messages. This creates a layered set of integration options, each with different tradeoffs: The proxy approach is the deepest external integration. You sit between the agent and the LLM API, intercepting every request. You maintain a manifest of all context blocks, score them for relevance, reconstruct the messages array with cold blocks evicted, and forward the pruned request. LiteLLM proxy already sits in this position for many deployments, making it a natural extension point. The risk is that the agent might get confused when it "remembers" asking about something but can't see the content. That's why the manifest pattern matters, you leave breadcrumbs. The MCP server approach is more practical but indirect. You expose tools like loadContext, switchTask, and getRelevantContext that act as a smart file manager. The agent still has old content in its history, but the MCP server guides it toward re-fetching fresh content instead of relying on stale cached context. You pair this with a skill file that instructs: "Before using file content from earlier in the conversation, call getRelevantContext to check if it's still current." You're not removing tokens, but you're influencing behavior. The skill approach is the lightest touch. A SKILL.md file that instructs the agent to treat old context as stale and re-read files instead of relying on memory. This is "soft eviction" through behavioral instructions. It sounds too simple to work, but it's surprisingly effective because the model will actually deprioritize content it's been told is irrelevant. The native approach is where the real opportunity is. If you own the inference loop, you own the context. You can tag every block that enters context with an ID, source, timestamp, and relevance score. You can score on every turn. You can evict or compress cold blocks into manifest entries. You can track re-fetch patterns and bump frequently-accessed blocks' priority so they stay resident.

The hard questions

Smart eviction sounds clean in theory. In practice, there are genuinely difficult problems to solve. Sleeper context. Some information seems irrelevant for twenty turns and then suddenly becomes critical. An architectural decision from the start of the session constrains a choice you're making now. If you evicted it at turn five, the agent might make an inconsistent decision at turn twenty-five. The manifest pattern helps here, the model can see that the decision existed and re-fetch it, but it requires the model to know it should look. Eviction threshold tuning. Too aggressive and the agent wastes turns re-fetching things it just dropped. Too conservative and you're back to context bloat. The right threshold probably varies by task type: coding sessions have different context dynamics than research tasks or creative writing. Re-fetch cost. Every eviction that triggers a re-fetch costs a tool call, which costs tokens and latency. If you're evicting and re-fetching the same block repeatedly, you're worse off than if you'd just kept it. A good system needs to track re-fetch patterns and promote frequently-accessed blocks to "pinned" status. Evaluation. How do you measure whether your eviction policy is actually helping? You need metrics that capture both token efficiency and task completion quality. A system that uses 50% fewer tokens but gets the wrong answer 10% more often isn't a win.

Where this is going

I think smart context eviction is one of the most underexplored areas in agent engineering right now. Everyone's focused on making models smarter or context windows bigger, but the real leverage is in what goes into the window in the first place. The teams that crack this, whether at the framework level, the proxy level, or the model level, will build agents that maintain coherence over hours instead of minutes. That's the difference between a tool you use for quick tasks and a tool you trust with sustained, complex work. Context windows will keep growing. Models will keep getting better at handling long inputs. But the fundamental tension between context size and attention quality is architectural. It's not going away with the next model release. The solution isn't bigger windows. It's smarter management of what's inside them.

References

Anthropic, "Effective context engineering for AI agents," Sep 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

Drew Breunig, "How long contexts fail," Jun 2025. https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html

Chroma Research, "Context Rot," 2025. https://research.trychroma.com/context-rot

Locret, "Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads," NeurIPS 2025. https://arxiv.org/html/2410.01805v2

KeyDiff, "Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference," NeurIPS 2025. https://neurips.cc/virtual/2025/poster/115521

JetBrains Research, "Cutting through the noise: Smarter context management for LLM-powered agents," Dec 2025. https://blog.jetbrains.com/research/2025/12/efficient-context-management/

Michael Bee, "The Bill of Lading: A Better Architecture for LLM Context Management," 2026. https://medium.com/@mbonsign/the-bill-of-lading-a-better-architecture-for-llm-context-management-834708af5ae0

Pankaj, "Dynamic Context Discovery: How AI Agents Are Learning to Fetch What They Need," Jan 2026. https://medium.com/@pankaj_pandey/dynamic-context-discovery-how-ai-agents-are-learning-to-fetch-what-they-need-bbd387dc8ed8

Teresa Torres, "Context Rot: Why AI Gets Worse the Longer You Chat," Feb 2026. https://www.producttalk.org/context-rot/