Skip to main content

Conversation Memory Planner

Sliding window versus running summary: token and cost math for chat memory designs.

100% client-side⌗ exact o200k tokenizer⌁ zero uploads
60
600
400
12
1.5k
8
$1.39

for the cheaper strategy — running summary wins this 60-turn session on Claude Sonnet 4.5, saving $0.15 over the alternative.

Sliding window (12 turns)$1.54

392.4k input · 24k output · drops everything older than 12 turns

Running summary (1.5k)$1.39

291.3k input · 34.5k output · 7 refresh calls included

The summary wins because re-sending 12 turns of raw history every call costs more than maintaining a 1.5k digest — even with 7 refresh calls billed at full rates.

Both strategies priced at Claude Sonnet 4.5 rates with no caching. Summary refreshes are modeled as extra calls: old summary + pending turns in, new summary out.

o200k
exact GPT tokenizer, in-browser
≈3.6
chars/token Claude estimate, documented
18
models in the cost dataset
0
network requests per keystroke

How it works

Every chat product over a few turns long needs a memory strategy, because models are stateless: whatever the assistant should "remember" must be re-sent as input tokens on every call. The two workhorse designs are the sliding window — keep the last N turns, drop the rest — and the running summary — maintain a compressed digest, refreshed periodically by an extra model call. Both work. They just have very different cost curves, and most teams pick one by vibes.

This planner runs both strategies over the same hypothetical session and prices them with verified per-token rates. For the window, turn k sends min(k, N) × average-turn-tokens of history. For the summary, each turn sends the digest plus the turns accumulated since the last refresh, and every Nth turn an additional call folds those pending turns into a new digest — that refresh call is billed honestly, input and output, at full rates.

The interesting output is the crossover. With the defaults — 60 turns, 600 tokens per turn, a 12-turn window against a 1,500-token summary refreshed every 8 turns — the strategies land close enough that the winner flips as you drag any slider. Shorter sessions favor the window; longer ones favor the summary; chattier turns accelerate the flip. The verdict line states which side of the crossover your configuration sits on and why.

Stated assumptions: turns are uniform in size (real conversations are lumpy, but uniform math brackets the decision), no prompt caching on either side, and summary refreshes never lose information they should keep — quality is your judgment call, cost is ours. Output tokens per turn are identical across strategies, so the difference you see is purely the memory design.

If your sessions resemble agentic coding more than chat — context growing monotonically with file reads and tool output — the Context Compaction Savings calculator models that shape directly, and the Agent Session Cost Estimator prices the uncompacted baseline. For measuring what your deployed assistant actually spends per conversation, that is what FORG does.

Frequently asked questions

What is the difference between a sliding window and a running summary?

A sliding window keeps the last N turns verbatim and silently drops everything older — simple, lossless within the window, total amnesia beyond it. A running summary maintains a compressed digest of the whole conversation, periodically refreshed by an extra model call. The window pays for raw recency; the summary pays refresh overhead for unlimited (lossy) history.

When does each strategy win on cost?

Short sessions and small windows favor the window: a few hundred tokens of recent history beats maintaining any summary. Long sessions, chatty turns, or wide windows flip it — re-sending twelve 600-token turns on every call costs more than a 1,500-token digest plus occasional refreshes. Slide the session length and watch the bars cross; that crossover is the real answer for your numbers.

Are the summary refresh calls included in the cost?

Yes, at full price. Every refresh is modeled as a real API call that reads the old summary plus all unsummarized turns as input and writes the new summary as output, at the selected model's verified rates. Calculators that treat summarization as free systematically flatter the summary strategy — ours bills it.

Which strategy is better for quality, not just cost?

They fail differently. Windows preserve exact wording but forget commitments made before the cutoff, which users notice as the bot contradicting itself. Summaries remember the whole arc but blur specifics — exact numbers, code identifiers, precise phrasing. Many production systems hybridize: a summary for the distant past plus a short verbatim window for recent turns.

Why no prompt caching in this comparison?

Caching helps both strategies and helps them differently per provider, which would bury the structural comparison under provider-specific assumptions. The uncached numbers isolate the memory-design decision. Once you pick a strategy, the Prompt Caching ROI Calculator will tell you what caching does to its absolute cost.

Built by FORG — AI cost observability for agentic coding. Free tools, no signup, nothing leaves your browser.

Learn about FORG